Re: ICU for global collation

From: Marina Polyakova <m(dot)polyakova(at)postgrespro(dot)ru>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, pryzby(at)telsasoft(dot)com, rjuju123(at)gmail(dot)com, daniel(at)manitou-mail(dot)org, AndrewBille(at)gmail(dot)com, michael(at)paquier(dot)xyz, peter(dot)eisentraut(at)enterprisedb(dot)com
Subject: Re: ICU for global collation
Date: 2022-09-16 07:31:42
Message-ID: 1989d430b926be3c08735f97fffc6294@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-09-16 07:55, Kyotaro Horiguchi wrote:
> At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova
> <m(dot)polyakova(at)postgrespro(dot)ru> wrote in
>> P.S. While working on the patch, I discovered that UTF8 encoding is
>> always used for the ICU provider in initdb unless it is explicitly
>> specified by the user:
>>
>> if (!encoding && locale_provider == COLLPROVIDER_ICU)
>> encodingid = PG_UTF8;
>>
>> IMO this creates additional errors for locales with other encodings:
>>
>> $ initdb --locale de_DE(dot)iso885915(at)euro --locale-provider icu
>> --icu-locale de-DE
>> ...
>> initdb: error: encoding mismatch
>> initdb: detail: The encoding you selected (UTF8) and the encoding that
>> the selected locale uses (LATIN9) do not match. This would lead to
>> misbehavior in various character string processing functions.
>> initdb: hint: Rerun initdb and either do not specify an encoding
>> explicitly, or choose a matching combination.
>>
>> And ICU supports many encodings, see the contents of pg_enc2icu_tbl in
>> encnames.c...
>
> It seems to me the best default that fits almost all cases using icu
> locales.
>
> So, we need to specify encoding explicitly in that case.
>
> $ initdb --encoding iso-8859-15 --locale de_DE(dot)iso885915(at)euro
> --locale-provider icu --icu-locale de-DE
>
> However, I think it is hardly understantable from the documentation.
>
> (I checked this using euc-jp [1] so it might be wrong..)
>
> [1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider
> icu --icu-locale ja-x-icu
>
> regards.

Thank you!

IMO it is hardly understantable from the program output either - it
looks like I manually chose the encoding UTF8. Maybe first inform about
selected encoding?..

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index
6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c
100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2310,7 +2310,11 @@ setup_locale_encoding(void)
}

if (!encoding && locale_provider == COLLPROVIDER_ICU)
+ {
encodingid = PG_UTF8;
+ printf(_("The default database encoding has been set to \"%s\" for a
better experience with the ICU provider.\n"),
+ pg_encoding_to_char(encodingid));
+ }
else if (!encoding)
{
int ctype_enc;

ISTM that such choices (e.g. UTF8 for Windows in some cases) are
described in the documentation [1] as

By default, initdb uses the locale provider libc, takes the locale
settings from the environment, and determines the encoding from the
locale settings. This is almost always sufficient, unless there are
special requirements.

[1] https://www.postgresql.org/docs/devel/app-initdb.html

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2022-09-16 07:39:42 RE: why can't a table be part of the same publication as its schema
Previous Message Masahiko Sawada 2022-09-16 07:29:32 Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans