Re: ICU for global collation

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: ICU for global collation
Date: 2022-03-17 10:28:39
Message-ID: 7c9270a4-db00-004b-df1a-fede4697a779@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15.03.22 08:41, Julien Rouhaud wrote:
>>>> The locale object in ICU is an identifier that specifies a particular locale
>>>> and has fields for language, country, and an optional code to specify further
>>>> variants or subdivisions. These fields also can be represented as a string
>>>> with the fields separated by an underscore.
>>
>> I think the Localization chapter needs to be reorganized a bit, but I'll
>> leave that for a separate patch.
>
> WFM.

I ended up writing a bit of content for that chapter.

>>> While on that topic, the doc should probably mention that default ICU
>>> collations can only be deterministic.
>>
>> Well, there is no option to do otherwise, so I'm not sure where/how to
>> mention that. We usually don't document options that don't exist. ;-)
>
> Sure, but I'm afraid that users may still be tempted to use ICU locales like
> und-u-ks-level2 from the case_insensitive example in the doc and hope that it
> will work accordingly. Or maybe it's just me that still sees ICU as dark magic
> and want to be extra cautious.

I have added this to the CREATE DATABASE ref page.

> Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and
> lc_collate. Should we add one for icu_locale too?

I'm not sure. I think the existing ones are more for backward
compatibility with the time before it was settable per database.

>>> in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
>>> and pg_database_collation_actual_version():
>>>
>>> - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
>>> - Assert(!isnull);
>>> - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
>>> + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
>>> + if (!isnull)
>>> + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
>>> + else
>>> + newversion = NULL;
>>>
>>> The columns are now nullable, but can you actually end up with a null locale
>>> name in the expected field without manual DML on the catalog, corruption or
>>> similar? I think it should be a plain error explaining the inconsistency
>>> rather then silently assuming there's no version. Note that at least
>>> pg_newlocale_from_collation() asserts that the specific libc/icu field it's
>>> interested in isn't null.
>>
>> This is required because the default collations's fields are null now.
>
> Yes I saw that, but that's a specific exception. Detecting whether it's the
> DEFAULT_COLLATION_OID or not and raise an error when a null value isn't
> expected seems like it could be worthwhile.

I have fixed that as you suggest.

> So apart from the few details mentioned above I'm happy with this patch!

committed!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-03-17 10:55:40 Re: Skipping logical replication transactions on subscriber side
Previous Message Peter Eisentraut 2022-03-17 10:22:32 pgsql: Add option to use ICU as global locale provider