Re: Order changes in PG16 since ICU introduction

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Joe Conway <mail(at)joeconway(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-06-12 09:37:58
Message-ID: beda0794-1d72-4584-8578-cf7d95fda396@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis wrote:

> I guess where I'm confused is: why would a user actually want their
> database collation to be C.UTF-8? It's slower than C, our
> implementation doesn't properly version it (as you pointed out), and
> the semantics don't seem great ('Z' < 'a').

Because when LC_CTYPE=C, characters outside of US ASCII are not
categorized properly. upper/lower/regexp matching/... produce
incorrect results.

> But if they don't specify the provider, isn't it much more likely they
> just don't care much about the locale, and would be happier with C?

Consider a pre-existing script doing initdb --locale=C.UTF-8
Surely it does care about the locale, otherwise it would not specify
it.
Assuming that it would be better off with C is assuming that a
non-Unicode aware locale is better than the Unicode-aware locale
they're asking. I don't think it's reasonable.

> The user can easily get libc behavior by specifying --locale-
> provider=libc, so I don't see how you reached this conclusion.

What would be user hostile is forcing users that don't need an ICU
locale to change their invocations of initdb/createdb to avoid
regressions with v16. Most people would discover this after
it breaks their apps.

> It looks like you are fine with 0003 applying LOCALE to whatever
> provider is chosen, but you'd like to be smarter about choosing the
> provider and to choose libc in at least some cases.
>
> That is actually very much like option #2 in the list I presented
> here[2], and has the same problems. How should the following behave?
>
> initdb --locale=C --lc-collate=fr_FR.utf8
> initdb --locale=en --lc-collate=fr_FR.utf8

The same as v15.

> If we switch to libc in the first case, then --locale will be ignored
> and the collation will be fr_FR.utf8.

$ initdb --locale=C --lc-collate=fr_FR.utf8
v15 does that:

The database cluster will be initialized with this locale configuration:
provider: libc
LC_COLLATE: fr_FR.utf8
LC_CTYPE: C
LC_MESSAGES: C
LC_MONETARY: C
LC_NUMERIC: C
LC_TIME: C
The default database encoding has accordingly been set to "SQL_ASCII".

--locale is not ignored, it's overriden for LC_COLLATE only.

> But we will leave the second case as ICU and the collation will be
> "en".

Yes. To me the rule for "ICU is the default" in v16 should be: if the
--locale argument points to a locale that we know ICU does not provide,
we fall back to the v15 behavior down to every detail, otherwise we let
ICU be the provider.

> You also suggested that we consider switching the provider to libc any
> time ICU doesn't support something. I'm not sure whether you meant a
> static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test.

C, C.*, POSIX. I'm not sure if there are other cases.

> I'm also not clear whether you think we should abandon the built-in
> provider, or still select it for C/POSIX.

I see it as going in v17, because it came after feature freeze and
is not strictly necessary in v16.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-06-12 11:39:43 RE: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Heikki Linnakangas 2023-06-12 09:13:53 Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG