Re: ICU integration

From: Doug Doole <ddoole(at)salesforce(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU integration
Date: 2016-09-06 17:40:02
Message-ID: CAP6UvaNTvYXmsJTiDQKwKTBJhO-axRXBjWTXJ5oAonSFNJ514g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> The ICU ABI (not API) is also versioned. The way that this is done is
> that all functions are actually macros to a versioned symbol. So
> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
> ICU version 57. (They also got rid of a dot in their versions a while
> ago.) It's basically hand-crafted symbol versioning. That way, you can
> link with multiple versions of ICU at the same time. However, the
> purpose of that, as I understand it, is so that plugins can have a
> different version of ICU loaded than the main process or another plugin.
> In terms of postgres using the right version of ICU, it doesn't buy
> anything beyond what the soname mechanism does.

You can access the versioned API as well, it's just not documented. (The
ICU team does support this - we worked very closely with them when doing
all this.) We exploited the versioned API when we learned that there is no
guarantee of backwards compatibility in collations. You can't just change a
collation under a user (at least that was our opinion) since it can cause
all sorts of problems. Refreshing a collation (especially on the fly) is a
lot more work than we were prepared to take on. So we exploited the
versioned APIs.

We carried the ICU version numbers around on our collation and locale IDs
(such as fr_FR%icu36) . The database would load multiple versions of the
ICU library so that something created with ICU 3.6 would always be
processed with ICU 3.6. This avoided the problems of trying to change the
rules on the user. (We'd always intended to provide tooling to allow the
user to move an existing object up to a newer version of ICU, but we never
got around to doing it.) In the code, this meant we were explicitly calling
the versioned API so that we could keep the calls straight. (Of course this
was abstracted in a set of our own locale functions so that the rest of the
engine was ignorant of the ICU library fun that was going on.)

> We can refine the guidance. But indexes are the most important issue, I
> think, because changing the sorting rules in the background makes data
> silently disappear.

I'd say that collation is the most important issue, but collation impacts a
lot more than indexes.

Unfortunately as part of changing companies I had to leave my "screwy stuff
that has happened in collations" presentation behind so I don't have
concrete examples to point to, but I can cook up illustrative examples:

- Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there
was an RI constraint where the primary key used AA and the foreign key
used Å. If ICU was updated, the RI constraint between the rows would break,
leaving an orphaned foreign key.

- I can't remember the specific language but they had the collation rule
that "CH" was treated as a distinct entity between C and D. This gave the
order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave
C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL
BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that
started with C. After the change it the constraint would block everything
that started with CI - CZ leaving many rows that no longer qualify in the
database.

It could be argued that these are edge cases and not likely to be hit.
That's likely true for a lot of users. But for a user who hits this, their
database is going to be left in a mess.

--
Doug Doole

On Tue, Sep 6, 2016 at 8:37 AM Peter Eisentraut <
peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:

> On 8/31/16 4:24 PM, Doug Doole wrote:
> > ICU explicitly does not provide stability in their locales and
> collations. We pushed them hard to provide this, but between changes to the
> CLDR data and changes to the ICU code it just wasn’t feasible for them to
> provide version to version stability.
> >
> > What they do offer is a compile option when building ICU to version all
> their APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or
> something like this - it’s been a few years since I actually worked with
> the ICU code.) This ultimately allows you to load multiple versions of the
> ICU library into a single program and provide stability by calling the
> appropriate version of the library. (Unfortunately, the OS - at least my
> Linux box - only provides the generic version of ICU and not the version
> annotated APIs, which means a separate compile of ICU is needed.)
> >
> > The catch with this is that it means you likely want to expose the
> version information. In another note it was suggested to use something like
> fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll
> likely need something like fr_FR%icu46. (There’s nothing wrong with
> supporting fr_FR%icu to give users an easy way of saying “give me the
> latest and greatest”, but you’d probably want to harden it to a specific
> ICU version internally.)
>
> There are multiple things going on.
>
> Collations in ICU are versioned. You can find out the version of the
> collation you are currently using using an API call. A collation
> version does not change during the life of a single version of ICU. But
> it might well change in the next version of ICU, as bugs are fixed and
> things are refined. There is no way in the API to call for a collation
> of a specific version, since there is only one version of a collation in
> a specific installation of ICU. So my implementation is that we store
> the version of the collation in the catalog when we create the
> collation, and if we later on find at run time that the collation is of
> a different version, we warn about it.
>
> The ICU ABI (not API) is also versioned. The way that this is done is
> that all functions are actually macros to a versioned symbol. So
> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
> ICU version 57. (They also got rid of a dot in their versions a while
> ago.) It's basically hand-crafted symbol versioning. That way, you can
> link with multiple versions of ICU at the same time. However, the
> purpose of that, as I understand it, is so that plugins can have a
> different version of ICU loaded than the main process or another plugin.
> In terms of postgres using the right version of ICU, it doesn't buy
> anything beyond what the soname mechanism does.
>
> >> + if (numversion != collform->collversion)
> >> + ereport(WARNING,
> >> + (errmsg("ICU collator version mismatch"),
> >> + errdetail("The database was created using
> >> version 0x%08X, the library provides version 0x%08X.",
> >> + (uint32) collform->collversion,
> >> (uint32) numversion),
> >> + errhint("Rebuild affected indexes, or build
> >> PostgreSQL with the right version of ICU.")));
> >>
> >> So you still need to manage this carefully, but at least you have a
> >> chance to learn about it.
> >
> > Indexes are the obvious place where collation comes into play, and are
> relatively easy to address. But consider all the places where string
> comparisons can be done. For example, check constraints and referential
> constraints can depend on string comparisons. If the collation rules change
> because of a new version of ICU, the database can become inconsistent and
> will need a lot more work than an index rebuild.
>
> We can refine the guidance. But indexes are the most important issue, I
> think, because changing the sorting rules in the background makes data
> silently disappear.
>
> --
> Peter Eisentraut http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-09-06 17:41:26 Re: Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl
Previous Message Robert Haas 2016-09-06 17:39:49 Re: Vacuum: allow usage of more than 1GB of work mem