Re: Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)
Date: 2017-09-26 20:03:35
Message-ID: CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=OsxE+O3KQ4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 24, 2017 at 9:24 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> * Documents the aforementioned keyword collation attribute restriction
> on ICU versions before ICU 54. This was needed anyway. We only claim
> for Postgres collations what the ICU docs claim for ICU collators,
> even though there is reason to believe that some ICU versions before
> ICU 54 actually can do better.

Following some digging, it looks like the restriction actually only
needs to apply on versions prior to ICU 4.6. The internal function
_canonicalize() contains the functionality need for Postgres to avoid
converting to BCP 47 itself on the fly (this was previously believed
to be the case only in ICU 54):

https://ssl.icu-project.org/trac/browser/tags/release-50-1-2/icu4c/source/common/uloc.cpp#L1614

(This is uloc.c on earlier ICU versions.)

(The call stack here is that from ucol_open(), ucol_open_internal() is
called, which calls uloc_getBaseName(), which calls _canonicalize().)

ICU gained that _hasBCP47Extension() branch in ICU 46. The
_canonicalize function has hardly changed since, despite the rewrite
from C to C++ (I checked up until ICU 55, the reasonably current
version I've done most testing on). This is totally consistent with
what both Peter and I have observed, even though the ucol_open() docs
suggest that that stuff only became available within ICU 54. I think
that ucol_open() was updated to mention BCP 47 at the same time as it
mentioned the lifting of the restrictions on extended keywords (when
you no longer had to use ucol_setAttribute()), confusing two basically
unrelated issues.

I suggest that as a compromise, my proposal can be changed in either
of the following ways:

* Do the same thing as I propose, except only introduce the mapping
from/to language tags when ICU version < 46 is in use. This would mean
that we'd be doing that far less in practice.

Or:

* Never do *any* mapping/language tag conversion when opening a
collator, but still consistently canonicalize as BCP 47 on all ICU
versions. This can be done by only supporting versions that have the
required _hasBCP47Extension() branch (ICU >= 46). We'd desupport the
old ICU 42 and ICU 44 versions.

Support for ICU 42 and ICU 44 came fairly late, in commit eccead9 on
August 5th. Both Tom and I expressed reservations about that at the
time. It doesn't seem too late to desupport those 2 versions, leaving
us with a fix that is even less invasive. We actually wouldn't
technically be changing anything about canonicalization at all, this
way. We'd be changing our understanding of when a restriction applies
(not < 54, < 46), at which point the "collcollate =
U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing that I take
particular issue with becomes "collcollate = U_ICU_VERSION_MAJOR_NUM
>= 46 ? langtag : name".

This is equivalent to "collcollate = langtag" (which is what it would
actually look like), since we're desupporting earlier versions.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-26 20:07:02 Re: [BUGS] BUG #14825: enum type: unsafe use?
Previous Message Jeff Janes 2017-09-26 20:00:21 Re: v10 pg_ctl compatibility