Re: ICU for global collation

From: Marina Polyakova <m(dot)polyakova(at)postgrespro(dot)ru>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pryzby(at)telsasoft(dot)com
Subject: Re: ICU for global collation
Date: 2022-08-20 09:41:51
Message-ID: 78ac48c77a365908d11d19519cb6ee56@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-08-17 19:53, Julien Rouhaud wrote:
> Good catch. There's unfortunately not a lot of regression tests for
> ICU-initialized clusters. I'm wondering if the build-farm client could
> be
> taught about the locale provider rather than assuming libc. Clearly
> that
> wouldn't have caught this issue, but it should still increase the
> coverage a
> bit (I'm thinking of the recent problem with the abbreviated keys).

Looking at installchecks with different locales (e.g. see [1] with
sv_SE.UTF-8) - why not?..

>> diff --git a/src/backend/commands/dbcommands.c
>> b/src/backend/commands/dbcommands.c
>> index
>> b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9
>> 100644
>> --- a/src/backend/commands/dbcommands.c
>> +++ b/src/backend/commands/dbcommands.c
>> @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt
>> *stmt)
>> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> errmsg("ICU locale must be specified")));
>> }
>> + else
>> + dbiculocale = NULL;
>>
>> if (dblocprovider == COLLPROVIDER_ICU)
>> check_icu_locale(dbiculocale);
>
> I think it would be better to do that in the various variable
> initialization.
> Maybe switch the dblocprovider and dbiculocale initialization, and only
> initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU.

diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index
b31a30550b025d48ba3cc250dc4c15f41f9a80be..883f381f3453142790f728a3725586cebe2e2f20
100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1012,10 +1012,10 @@ createdb(ParseState *pstate, const CreatedbStmt
*stmt)
dbcollate = src_collate;
if (dbctype == NULL)
dbctype = src_ctype;
- if (dbiculocale == NULL)
- dbiculocale = src_iculocale;
if (dblocprovider == '\0')
dblocprovider = src_locprovider;
+ if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
+ dbiculocale = src_iculocale;

/* Some encodings are client only */
if (!PG_VALID_BE_ENCODING(encoding))

Then it seemed to me that it was easier to first get all the parameters
from the template database as usual and then process them as needed. But
with your suggestion the failed assertion will check the code above more
accurately...

> This discrepancy between createdb and CREATE DATABASE looks like an
> oversight,
> as createdb indeed interprets --locale as:
>
> if (locale)
> {
> if (lc_ctype)
> pg_fatal("only one of --locale and --lc-ctype can be specified");
> if (lc_collate)
> pg_fatal("only one of --locale and --lc-collate can be specified");
> lc_ctype = locale;
> lc_collate = locale;
> }
>
> AFAIK the fallback in the CREATE DATABASE case is expected as POSIX
> locale
> names should be accepted by icu, so this should work for createdb too.

Oh, great, thanks!

> > > I spent some time looking at the ICU api trying to figure out if using a
> > > posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> > > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> > > same locale, but I might be wrong. I also didn't find a way to figure out how
> > > to ask ICU if the locale identifier passed is complete garbage or not. One
> > > sure thing is that the system collation we import are of the form 'en-us', so
> > > it seems weird to have this form in pg_collation and by default another form in
> > > pg_database.
> >
> > Yeah it seems to be inconsistent about that. The locale ID documentation
> > appears to indicate that "en_US" is the canonical form, but when you ask it
> > to list all the locales it knows about it returns "en-US".
>
> Yeah I saw that too when checking is POSIX locale names were valid, and
> that's
> not great.

I'm sorry but IIUC pg_import_system_collations uses uloc_getAvailable to
get the locale ID and then specifically calls uloc_toLanguageTag?..

> I don't think that initdb --collation-provider icu should be allowed
> without
> --icu-locale, same for --collation-provider libc *with* --icu-locale.

> > initdb has some specific processing to transform the default libc locale to
> > something more appropriate, but as far as I can see creatdb / CREATE DATABASE
> > aren't doing that. It seems inconsistent, and IMHO another reason why
> > defaulting to the libc locale looks like a bad idea.
>
> This has all been removed. The separate ICU locale option should now
> be
> required everywhere (initdb, createdb, CREATE DATABASE).

If it's a feature and not a bug in CREATE DATABASE, why should not it
work in initdb too? Here we define locale/lc_collate/lc_ctype for the
first 3 databases in the cluster in much the same way...

P.S. FYI there seems to be a bug for very old ICU versions: in master
(92fce4e1eda9b24d73f583fbe9b58f4e03f097a4):

$ initdb -D data &&
pg_ctl -D data -l logfile start &&
psql -c "CREATE DATABASE mydb LOCALE \"C.UTF-8\" LOCALE_PROVIDER icu
TEMPLATE template0" postgres &&
psql -c "SELECT 1" mydb

WARNING: database "mydb" has a collation version mismatch
DETAIL: The database was created using collation version 49.192.0.42,
but the operating system provides version 49.192.5.42.
HINT: Rebuild all objects in this database that use the default
collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or
build PostgreSQL with the right library version.

See the additional output (diff_log_icu_collator_locale.patch) in the
logfile:

2022-08-20 11:38:30.162 MSK [136546] LOG: check_icu_locale
uloc_getDefault() en_US
2022-08-20 11:38:30.162 MSK [136546] STATEMENT: CREATE DATABASE mydb
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG: check_icu_locale icu_locale
C.UTF-8 valid_locale en_US version 49.192.0.42
2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG: get_collation_actual_version
uloc_getDefault() en_US
2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG: get_collation_actual_version
icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42
2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.224 MSK [136548] LOG: make_icu_collator
uloc_getDefault() c
2022-08-20 11:38:30.225 MSK [136548] LOG: make_icu_collator icu_locale
C.UTF-8 valid_locale root version 49.192.5.42
2022-08-20 11:38:30.225 MSK [136548] LOG: get_collation_actual_version
uloc_getDefault() c
2022-08-20 11:38:30.225 MSK [136548] LOG: get_collation_actual_version
icu_locale C.UTF-8 valid_locale root version 49.192.5.42
2022-08-20 11:38:30.225 MSK [136548] WARNING: database "mydb" has a
collation version mismatch
2022-08-20 11:38:30.225 MSK [136548] DETAIL: The database was created
using collation version 49.192.0.42, but the operating system provides
version 49.192.5.42.
2022-08-20 11:38:30.225 MSK [136548] HINT: Rebuild all objects in this
database that use the default collation and run ALTER DATABASE mydb
REFRESH COLLATION VERSION, or build PostgreSQL with the right library
version.

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2022-08-18%2006%3A25%3A17

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
diff_log_icu_collator_locale.patch text/x-diff 4.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2022-08-20 09:52:29 Re: including pid's for `There are XX other sessions using the database`
Previous Message David Rowley 2022-08-20 09:17:41 Re: shadow variables - pg15 edition