Re: ICU for global collation

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: ICU for global collation
Date: 2022-03-15 07:41:03
Message-ID: 20220315070727.xf4sateg3apjwnro@jrouhaud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 14, 2022 at 01:50:50PM +0100, Peter Eisentraut wrote:
> On 05.03.22 09:38, Julien Rouhaud wrote:
> > I say it works because I did manually check, as far as I can see there isn't
> > any test that ensures it.
> >
> > I'm using this naive scenario:
> >
> > DROP DATABASE IF EXISTS dbicu;
> > CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0';
> > \c dbicu
> > CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
> > CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst);
> > INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B');
> > SELECT def AS def FROM icu ORDER BY def;
> > SELECT def AS en FROM icu ORDER BY en;
> > SELECT def AS upfirst FROM icu ORDER BY upfirst;
> > SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
> > SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";
> >
> > Maybe there should be some test along those lines included in the patch?
>
> I added something like this to a new test suite under src/test/icu/.
> (src/test/locale/ was already used for something else.)

Great, thanks!

> > > The locale object in ICU is an identifier that specifies a particular locale
> > > and has fields for language, country, and an optional code to specify further
> > > variants or subdivisions. These fields also can be represented as a string
> > > with the fields separated by an underscore.
>
> I think the Localization chapter needs to be reorganized a bit, but I'll
> leave that for a separate patch.

WFM.

> > I spent some time looking at the ICU api trying to figure out if using a
> > posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> > same locale, but I might be wrong. I also didn't find a way to figure out how
> > to ask ICU if the locale identifier passed is complete garbage or not. One
> > sure thing is that the system collation we import are of the form 'en-us', so
> > it seems weird to have this form in pg_collation and by default another form in
> > pg_database.
>
> Yeah it seems to be inconsistent about that. The locale ID documentation
> appears to indicate that "en_US" is the canonical form, but when you ask it
> to list all the locales it knows about it returns "en-US".

Yeah I saw that too when checking is POSIX locale names were valid, and that's
not great.
>
> > In CREATE DATABASE manual:
> >
> > + Specifies the provider to use for the default collation in this
> > + database. Possible values are:
> > + <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm>
> > + <literal>libc</literal>. <literal>libc</literal> is the default. The
> > + available choices depend on the operating system and build options.
> >
> > That's actually not true as pg_strcasecmp is used in createdb():
>
> I don't understand what the concern is here.

Ah sorry I missed the <indexterm>, I thought the possible values listed were
icu, ICU and libc. Shouldn't the ICU <indexterm> be before the icu <literal>
rather than the libc, or at least before the comma?
>
> Yeah, I think it is not the job of the initdb man page to lecture people
> about the merits of their locale choice. The same performance warning can
> also be found in the localization chapter; we don't need to repeat it
> everywhere a locale choice is mentioned.

Ah, I tried to look for another place where the same warning could be found but
missed it. I'm fine with it then!

> > While on that topic, the doc should probably mention that default ICU
> > collations can only be deterministic.
>
> Well, there is no option to do otherwise, so I'm not sure where/how to
> mention that. We usually don't document options that don't exist. ;-)

Sure, but I'm afraid that users may still be tempted to use ICU locales like
und-u-ks-level2 from the case_insensitive example in the doc and hope that it
will work accordingly. Or maybe it's just me that still sees ICU as dark magic
and want to be extra cautious.

Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and
lc_collate. Should we add one for icu_locale too?

> > Also unrelated, but how about raising a warning about possibly hiding
> > corruption if you use CREATE COLLATION ... VERSION or CREATE DATABASE ...
> > COLLATION_VERSION if !IsBinaryUpgrade()?
>
> Hmm, there is a point to that. But then we should just make that an error.
> Otherwise, we raise the warning but then there is no way to "fix" what was
> warned about. Seems confusing.

I was afraid that an error was a bit too much, but if that's acceptable I agree
that it's better.

> > in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
> > and pg_database_collation_actual_version():
> >
> > - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
> > - Assert(!isnull);
> > - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> > + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
> > + if (!isnull)
> > + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> > + else
> > + newversion = NULL;
> >
> > The columns are now nullable, but can you actually end up with a null locale
> > name in the expected field without manual DML on the catalog, corruption or
> > similar? I think it should be a plain error explaining the inconsistency
> > rather then silently assuming there's no version. Note that at least
> > pg_newlocale_from_collation() asserts that the specific libc/icu field it's
> > interested in isn't null.
>
> This is required because the default collations's fields are null now.

Yes I saw that, but that's a specific exception. Detecting whether it's the
DEFAULT_COLLATION_OID or not and raise an error when a null value isn't
expected seems like it could be worthwhile.

> > I'm wondering if we could now always return &default_locale and avoid having to
> > check if the function returned something in all callers, since CheckMyDatabase
> > now initialize it.
>
> Maybe that's something to look into, but that would probably require
> updating a call callers to handle the return values differently, which would
> require quite a bit of work.

Agreed, I just wanted to mention it before forgetting about it.

So apart from the few details mentioned above I'm happy with this patch!

There might still be some more adjustments needed based on Robert's comment.
At the very least it doesn't seem unreasonable to forbid a default ICU
collation with a C/POSIX lc_ctype for instance.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-03-15 07:46:11 Re: Can we consider "24 Hours" for "next day" in INTERVAL datatype ?
Previous Message Laurenz Albe 2022-03-15 07:40:12 Re: Can we consider "24 Hours" for "next day" in INTERVAL datatype ?