From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?) |
Date: | 2017-09-26 20:03:35 |
Message-ID: | CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=OsxE+O3KQ4g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Sep 24, 2017 at 9:24 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> * Documents the aforementioned keyword collation attribute restriction
> on ICU versions before ICU 54. This was needed anyway. We only claim
> for Postgres collations what the ICU docs claim for ICU collators,
> even though there is reason to believe that some ICU versions before
> ICU 54 actually can do better.
Following some digging, it looks like the restriction actually only
needs to apply on versions prior to ICU 4.6. The internal function
_canonicalize() contains the functionality need for Postgres to avoid
converting to BCP 47 itself on the fly (this was previously believed
to be the case only in ICU 54):
https://ssl.icu-project.org/trac/browser/tags/release-50-1-2/icu4c/source/common/uloc.cpp#L1614
(This is uloc.c on earlier ICU versions.)
(The call stack here is that from ucol_open(), ucol_open_internal() is
called, which calls uloc_getBaseName(), which calls _canonicalize().)
ICU gained that _hasBCP47Extension() branch in ICU 46. The
_canonicalize function has hardly changed since, despite the rewrite
from C to C++ (I checked up until ICU 55, the reasonably current
version I've done most testing on). This is totally consistent with
what both Peter and I have observed, even though the ucol_open() docs
suggest that that stuff only became available within ICU 54. I think
that ucol_open() was updated to mention BCP 47 at the same time as it
mentioned the lifting of the restrictions on extended keywords (when
you no longer had to use ucol_setAttribute()), confusing two basically
unrelated issues.
I suggest that as a compromise, my proposal can be changed in either
of the following ways:
* Do the same thing as I propose, except only introduce the mapping
from/to language tags when ICU version < 46 is in use. This would mean
that we'd be doing that far less in practice.
Or:
* Never do *any* mapping/language tag conversion when opening a
collator, but still consistently canonicalize as BCP 47 on all ICU
versions. This can be done by only supporting versions that have the
required _hasBCP47Extension() branch (ICU >= 46). We'd desupport the
old ICU 42 and ICU 44 versions.
Support for ICU 42 and ICU 44 came fairly late, in commit eccead9 on
August 5th. Both Tom and I expressed reservations about that at the
time. It doesn't seem too late to desupport those 2 versions, leaving
us with a fix that is even less invasive. We actually wouldn't
technically be changing anything about canonicalization at all, this
way. We'd be changing our understanding of when a restriction applies
(not < 54, < 46), at which point the "collcollate =
U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name" thing that I take
particular issue with becomes "collcollate = U_ICU_VERSION_MAJOR_NUM
>= 46 ? langtag : name".
This is equivalent to "collcollate = langtag" (which is what it would
actually look like), since we're desupporting earlier versions.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-09-26 20:07:02 | Re: [BUGS] BUG #14825: enum type: unsafe use? |
Previous Message | Jeff Janes | 2017-09-26 20:00:21 | Re: v10 pg_ctl compatibility |