Re: ICU for global collation

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: ICU for global collation
Date: 2022-01-07 09:03:29
Message-ID: 20220107090329.lnsy6tnodrrbmyy5@jrouhaud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I looked a bit more in this patch and I have some additional remarks.

On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote:
>
> So this is a different approach: If you choose ICU as the default locale for
> a database, you still need to specify lc_ctype and lc_collate settings, as
> before. Unlike in the previous patch, where the ICU collation name was
> written in datcollate, there is now a third column (daticucoll), so we can
> store all three values. This fixes the described problem. Other than that,
> once you get all the initial settings right, it basically just works: The
> places that have ICU support now will use a database-wide ICU collation if
> appropriate, the places that don't have ICU support continue to use the
> global libc locale settings.

So just to confirm a database can now have 2 different *default* collations: a
libc-based one for everything that doesn't work with ICU and a ICU-based (if
specified) for everything else, and the ICU-based is optional, so if not
provided everything works as before, using the libc based default collation.

As I mentioned I think this approach is sensible. However, should we document
what are the things that are not ICU-aware?

> I changed the datcollate, datctype, and the new daticucoll fields to type
> text (from name). That way, the daticucoll field can be set to null if it's
> not applicable. Also, the limit of 63 characters can actually be a problem
> if you want to use some combination of the options that ICU locales offer.
> And for less extreme uses, having variable-length fields will save some
> storage, since typical locale names are much shorter.

I understand the need to have daticucoll as text, however it's not clear to me
why this has to be changed for datcollate and datctype? IIUC those will only
ever contain libc-based collation and are still mandatory?
>
> For the same reasons and to keep things consistent, I also changed the
> analogous pg_collation fields like that.

The respective fields in pg_collation are now nullable, so the changes there
sounds ok.

Digging a bit more in the patch here are some things that looks problematic.

- pg_upgrade

It checks (in check_locale_and_encoding()) the compatibility for each database,
and it looks like the daticucoll field should also be verified. Other than
that I don't think there is anything else needed for the pg_upgrade part as
everything else should be handled by pg_dump (I didn't look at the changes yet
given the below problems).

- CREATE DATABASE

There's a new COLLATION_PROVIDER option, but the overall behavior seems quite
unintuitive. As far as I can see the idea is to use LOCALE for the ICU default
collation, but as it's providing a default for the other values it's quite
annoying. For instance:

=# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'fr-x-icu' LC_COLLATE 'en_GB.UTF-8';;
ERROR: 42809: invalid locale name: "fr-x-icu"
LOCATION: createdb, dbcommands.c:397

Looking at the code it's actually complaining about LC_CTYPE. If you want a
database with an ICU default collation the lc_collate and lc_ctype should
inherit what's in the template database and not what was provided in the
LOCALE I think. You could still probably overload them in some scenario, but
without a list of what isn't ICU-aware I can't really be sure of how often one
might have to do it.

Now, if I specify everything as needed it looks like it's missing some checks
on the ICU default collation when not using template0:

=# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';;
CREATE DATABASE

=# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
(4 rows)

Unless I'm missing something the same concerns about collation incompatibility
with objects in the source database should also apply for the ICU collation?

While at it, I'm not exactly sure of what the COLLATION_PROVIDER is supposed to
mean, as the same commands but with a libc provider is accepted and has
the exact same result:

=# CREATE DATABASE db2 COLLATION_PROVIDER libc LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';;
CREATE DATABASE

=# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
db2 | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
(5 rows)

Shouldn't db2 have a NULL daticucoll, and if so also complain about
incompatibility for it?

- initdb

I don't think that initdb --collation-provider icu should be allowed without
--icu-locale, same for --collation-provider libc *with* --icu-locale.

When trying that, I can also see that the NULL handling for daticucoll is
broken in the BKI:

$ initdb -k --collation-provider icu
[...]
Success. You can now start the database server using:

=# select datname, datcollate, datctype, daticucoll from pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
template1 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
template0 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
(3 rows)

There's a fallback on my LANG/LC_* env settings, but I don't think it can ever
be correct given the different naming convention in ICU (at least the s/_/-/).

And

$ initdb -k --collation-provider libc --icu-locale test
[...]
Success. You can now start the database server using:

=# select datname, datcollate, datctype, daticucoll from pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | _null_
template1 | en_GB.UTF-8 | en_GB.UTF-8 | _null_
template0 | en_GB.UTF-8 | en_GB.UTF-8 | _null_
(3 rows)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-01-07 09:05:54 Re: row filtering for logical replication
Previous Message Alexander Lakhin 2022-01-07 09:00:30 Re: Index-only scan for btree_gist turns bpchar to char