Re: Order changes in PG16 since ICU introduction

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Sandro Santilli <strk(at)kbt(dot)io>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Regina Obe <lr(at)pcorp(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-05-24 15:39:07
Message-ID: 660e18263897e046e9104cc60c1cea14dc9e7551.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote:
> While I agree that the LOCALE option in CREATE DATABASE is
> counter-intuitive,

I think it's more than that. As Andreww Gierth pointed out:

$ initdb --locale=fr_FR
...
ICU locale: en-US
...

Is more than just counter-intuitive. I don't think we can ship 16 that
way.

> I find it questionable that blending ICU
> and libc locales into it helps that much with the user experience.

Thank you for going through some examples here. I agree that it's not
perfect, but we need some path to a reasonable ICU user experience, and
I think we'll have to accept some rough edges to avoid the worst cases,
like above.

> initdb:
>
>   Using default ICU locale "fr".
>   Using language tag "fr" for ICU locale "fr".
>

...

> #1
>
> postgres=# create database test1 locale='fr_FR.UTF-8';
> NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
> ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of

I don't see a problem here. If you specify LOCALE to CREATE DATABASE,
you should either be using "TEMPLATE template0", or you should be
expecting an error if the LOCALE doesn't match exactly.

What would you like to see happen here?

> #2
>
> postgres=# create database test2 locale='C.UTF-8'
> template='template0';
> NOTICE:  using standard form "en-US-u-va-posix" for ICU locale
> "C.UTF-8"
> CREATE DATABASE
>
>
> en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so
> this interpretation is arguably not what a user would expect.

As you pointed out, this is not settled in libc either:

https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f%40manitou-mail.org

We really can't expect a particular order for a particular locale name,
unless we handle it specially like "C" or "POSIX". If we pass it to the
provider, we have to trust the provider to match our conceptual
expectations for that locale (and ideally version it properly).

> I would expect the ICU warning or error (icu_validation_level) to
> kick
> in instead of that transliteration.

Earlier versions of ICU (<= 63) do this transformation automatically,
and I don't see a reason to throw an error if ICU considers it valid.
The language tag en-US-u-va-posix will be stored in the catalog, and
that will be considered valid in later versions of ICU.

Later versions of ICU (>= 64) consider locales with a language name of
"C" to be obsolete and no longer valid. I added code to do the
transformation without error in these later versions, but I think we
have agreement to remove it.

If a user specifies the locale as "C.UTF-8", we can either pass it to
ICU and see whether that version accepts it or not (and if not, throw a
warning/error); or if we decide that "C.UTF-8" really means "C", we can
handle it in the memcmp() path like C and never send it to ICU.

> #3
>
> $ grep french /etc/locale.alias
> french          fr_FR.ISO-8859-1
>
> postgres=# create database test3 locale='french' template='template0'
> encoding='LATIN1';
> WARNING:  ICU locale "french" has unknown language "french"
> HINT:  To disable ICU locale validation, set parameter
> icu_validation_level
> to DISABLED.
> CREATE DATABASE
>
>
> In practice we're probably getting the "und" ICU locale whereas "fr"
> would
> be appropriate.

This is a good point and illustrates that ICU is not a drop-in
replacement for libc in all cases.

I don't see a solution here that doesn't involve some rough edges,
though. "Locale" is a generic term, and if we continue to insist that
it really means a libc locale, then ICU will never be on an equal
footing with libc, let alone the preferred provider.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-24 15:57:24 Re: memory leak in trigger handling (since PG12)
Previous Message Tom Lane 2023-05-24 15:37:10 Re: memory leak in trigger handling (since PG12)