Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)
Date: 2017-09-25 04:24:43
Message-ID: CAH2-WzksxL=rbdVgvCQSruav5ssn+=WBbRGzbchAH2w8WxNyQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> After reviewing this thread and testing around a bit, I think the code
> is mostly fine as it is, but the documentation is lacking. Hence,
> attached is a patch to expand the documentation quite a bit, especially
> to document in more detail what ICU locale strings are accepted.

Attached is my counterproposal. This patch has us always make sure
that collcollate is in BCP 47 format, even on ICU versions prior to
ICU 54.

Other improvements/bug fixes:

* Sanitizes locale names within CREATE COLLATION. This has been tested
on ICU 42 (earliest supported version) and ICU 55.

* Enforces a NAMEDATALEN restriction for collcollate during CREATE
COLLATION, forbidding collcollate truncation. This is useful because
truncating can allow subtly wrong answers later on.

* Adds DEBUG1 message with ICU display name, so we can satisfy
ourselves that we're getting the expected behavior, to some degree.

I used this to confirm that we get consistent behavior between ICU 42
and 55 for CREATE COLLATION. On ICU 42, keyword collation attributes
(e.g., the emoji keyword, numeric ordering for natural sort order)
still don't work, just as before, but the locale string is still
considered valid. (This is because ucol_setAttribute() is supposed to
be used there).

* Documents the aforementioned keyword collation attribute restriction
on ICU versions before ICU 54. This was needed anyway. We only claim
for Postgres collations what the ICU docs claim for ICU collators,
even though there is reason to believe that some ICU versions before
ICU 54 actually can do better.

* When using ICU 4.2, the examples in the docs (variant collations
like German Phonebook order) now actually work.

The examples are completely broken right now IMV, because the user has
to know that they are on ICU 4.2, which only accepts the old style
locale strings as things stand. And, they'd have no obvious indication
that things were broken without this patch, because there would have
been no sanitization or other feedback.

* Creates root collation as root-x-icu (collcollate "root"), not
und-x-icu. "und" means undefined language.

* Moves all knowledge of ICU version issues to within a few
pg_locale.c routines, leaving the code reasonably well encapsulated.

* Does an encoding conversion when getting a display name for the
initdb collation comment. This needs to be ascii-safe due to the
initdb/template0 DB encoding restriction, but I suspect that the way
we do it now is subtly wrong. This does imply that SQL_ASCII databases
will never get ICU pg_collation entries that come with comments added
by initdb, but such databases were already unable to use the
collations anyway, so this is no loss at all. (SQL_ASCII is mentioned
here as an example of a database collation that ICU doesn't support,
all of which are affected in this way.)

I decided to implement CREATE COLLATION sanitization based on whether
or not a display string can be generated (if not, or if it's empty, we
reject). This seems like about the right level of strictness to me,
because we're still very forgiving. I admit that that's a little bit
arbitrary, but it does seem to be a good match for Postgres; it is
forgiving of things that could plausibly make sense on another ICU
version to some user at some time, but rejects most things that are
inherently wrong, IMHO. You can still ask for Japanese as spoken in
Madagascar, or even specify a language that ICU has never heard of,
and there is no error. It catches syntax errors only. See the slightly
expanded tests for details. I'm very open to negotiating the exact
details of how we sanitize, but any level of sanitization will be at
least a little bit arbitrary (including what we have right now, which
is no sanitization).

Aside from the specific problems for Postgres that I've noted that the
patch prevents or fixes, there is another reason to do this. The old
style locale name format is officially deprecated by ICU, which makes
it seem like we should never expose it to users in the first place.
Per ICU docs:

"Starting with ICU 54, the following naming scheme and its API
functions are deprecated. Use ucol_open() with language tag collation
keywords instead (see Collation API Details)" [1]

[1] http://userguide.icu-project.org/collation/concepts#TOC-Collator-naming-scheme
--
Peter Geoghegan

Attachment Content-Type Size
0001-Consistently-canonicalize-ICU-collations-collcollate.patch text/x-patch 21.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-09-25 04:25:24 Re: Page Scan Mode in Hash Index
Previous Message Andres Freund 2017-09-25 04:09:19 Re: What's with all the fflush(stderr) calls in pg_standby.c?