Re: ICU for global collation

From: Marina Polyakova <m(dot)polyakova(at)postgrespro(dot)ru>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, pryzby(at)telsasoft(dot)com, rjuju123(at)gmail(dot)com, daniel(at)manitou-mail(dot)org, AndrewBille(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject: Re: ICU for global collation
Date: 2022-10-08 18:08:18
Message-ID: 79f410460c4fc9534000785adb8bf39a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-10-01 15:07, Peter Eisentraut wrote:
> On 22.09.22 20:06, Marina Polyakova wrote:
>> On 2022-09-21 17:53, Peter Eisentraut wrote:
>>> Committed with that test, thanks.  I think that covers all the ICU
>>> issues you reported for PG15 for now?
>>
>> I thought about the order of the ICU checks - if it is ok to check
>> that the selected encoding is supported by ICU after printing all the
>> locale & encoding information, why not to move almost all the ICU
>> checks here?..
>
> It's possible that we can do better, but I'm not going to add things
> like that to PG 15 at this point unless it fixes a faulty behavior.

Will PG 15 always have this order of ICU checks, is the current
behaviour correct enough? On the other hand, there may be a better fix
for PG 16+ and not all changes can be backported...

On 2022-09-16 10:56, Peter Eisentraut wrote:
> On 15.09.22 17:41, Marina Polyakova wrote:
>> I agree with you. Here's another version of the patch. The
>> locale/encoding checks and reports in initdb have been reordered,
>> because now the encoding is set first and only then the ICU locale is
>> checked.
>
> I committed something based on the first version of your patch. This
> reordering of the messages here was a little too much surgery for me
> at this point. For instance, there are also messages in #ifdef WIN32
> code that would need to be reordered as well. I kept the overall
> structure of the code the same and just inserted the additional
> proposed checks.
>
> If you want to pursue the reordering of the checks and messages
> overall, a patch for the master branch could be considered.

I've worked on this again (see attached patch) but I'm not sure if the
messages of encoding mismatches are clear enough without the full locale
information. For

$ initdb -D data --icu-locale en --locale-provider icu

compare the outputs:

The database cluster will be initialized with this locale configuration:
provider: icu
ICU locale: en
LC_COLLATE: de_DE(dot)iso885915(at)euro
LC_CTYPE: de_DE(dot)iso885915(at)euro
LC_MESSAGES: en_US.utf8
LC_MONETARY: de_DE(dot)iso885915(at)euro
LC_NUMERIC: de_DE(dot)iso885915(at)euro
LC_TIME: de_DE(dot)iso885915(at)euro
The default database encoding has been set to "UTF8".
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that
the selected locale uses (LATIN9) do not match. This would lead to
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.

and

Encoding "UTF8" implied by locale will be set as the default database
encoding.
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that
the selected locale uses (LATIN9) do not match. This would lead to
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.

The same without ICU, e.g. for

$ initdb -D data

the output with locale information:

The database cluster will be initialized with this locale configuration:
provider: libc
LC_COLLATE: en_US.utf8
LC_CTYPE: de_DE(dot)iso885915(at)euro
LC_MESSAGES: en_US.utf8
LC_MONETARY: de_DE(dot)iso885915(at)euro
LC_NUMERIC: de_DE(dot)iso885915(at)euro
LC_TIME: de_DE(dot)iso885915(at)euro
The default database encoding has accordingly been set to "LATIN9".
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (LATIN9) and the encoding that
the selected locale uses (UTF8) do not match. This would lead to
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.

and the "shorter" version:

Encoding "LATIN9" implied by locale will be set as the default database
encoding.
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (LATIN9) and the encoding that
the selected locale uses (UTF8) do not match. This would lead to
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.

BTW, what did you mean that "there are also messages in #ifdef WIN32
code that would need to be reordered as well"?..

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
v2-diff_icu_options_check_order.patch text/x-diff 13.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-10-08 21:14:04 Re: Adding Support for Copy callback functionality on COPY TO api
Previous Message Tom Lane 2022-10-08 17:44:41 Re: Non-robustness in pmsignal.c