Re: Order changes in PG16 since ICU introduction

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Sandro Santilli <strk(at)kbt(dot)io>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Regina Obe <lr(at)pcorp(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-05-18 22:46:52
Message-ID: 5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-05-18 at 20:11 +0200, Matthias van de Meent wrote:
> As I complain about in [0], since 5cd1a5af --no-locale has been
> broken
> / bahiving outside it's description: Instead of being equivalent to
> `--locale=C` it now also overrides `--locale-provider=libc`,
> resulting
> in undocumented behaviour.

I agree that 5cd1a5af is incomplete.

Posting updated patches. Feedback on the approaches below would be
appreciated.

For context, in version 15:

$ initdb -D data --locale-provider=icu --icu-locale=en
=> create database clocale template template0 locale='C';
=> select datname, datlocprovider, daticulocale
from pg_database where datname='clocale';
datname | datlocprovider | daticulocale
---------+----------------+--------------
clocale | i | en
(1 row)

That behavior is confusing, and when I made ICU the default provider in
v16, the confusion was extended into more cases.

If we leave the CREATE DATABASE (and createdb and initdb) syntax in
place, such that LOCALE (and --locale) do not apply to ICU at all, then
I don't see a path to a good ICU user experience.

Therefore I conclude that we need LOCALE (and --locale) to apply to ICU
somehow. (The LOCALE option already applies to ICU during CREATE
COLLATION, just not CREATE DATABASE or initdb.)

Patch 0003 does this. It's fairly straightforward and I believe we need
this patch.

But to actually fix your complaint we also need --no-locale to be
equivalent to --locale=C and for those options to both use memcmp()
semantics. There are several approaches to accomplish this, and I think
this is the part where I most need some feedback. There are only so
many approaches, and each one has some potential downsides, but I
believe we need to select one:

(1) Give up and leave the existing CREATE DATABASE (and createdb, and
initdb) semantics in place, along with the confusing behavior in v15.

This is a last resort, in my opinion. It gives us no path toward a good
user experience with ICU, and leaves us with all of the problems of the
OS as a collation provider.

(2) Automatically change the provider to libc when locale=C.

Almost works, but it's not clear how we handle the case "provider=icu
lc_collate='fr_FR.utf8' locale=C".

If we change it to "provider=libc lc_collate=C", we've overridden the
specified lc_collate. If we ignore the locale=C, that would be
surprising to users. If we throw an error, that would be a backwards
compatibility issue.

One possible solution would be to change the catalog representation to
allow setting the default collation locale separately from datcollate
even for the libc provider. For instance, rename daticulocale to
datdeflocale, and store the default collation locale there for both
libc and ICU. Then, "provider=icu lc_collate='fr_FR.utf8' locale=C"
could be changed into "provider=libc lc_collate='fr_FR.utf8'
deflocale=C". It may be confusing that datcollate is a different
concept from datdeflocale; but then again they are different concepts
and it's confusing that they are currently combined into one.

(3) Support iculocale=C in the ICU provider using the memcmp() path.

In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
lc_ctpye_is_c() would both return true.

There's a potential problem for users who've misused ICU in the past
(15 or earlier) by using provider=icu and iculocale=C. ICU would accept
such a locale name, but not recognize it and fall back to the root
locale, so it never worked as the user intended it. But if we redefine
C to be memcmp(), then such users will have broken indexes if they
upgrade.

We could add a check at pg_upgrade time for iculocale=C in versions 15
and earlier, and cause the check (and therefore the upgrade) to fail.
That may be reasonable considering that it never really worked in the
past, and perhaps very few users actually ever created such a
collation. But if some user runs into that problem, we'd have to resort
to a hack like telling them to "update pg_collation set iculocale='und'
where iculocale='C'" and then try the upgrade again, which is not a
great answer (as far as I can tell it would be a correct answer and
should not break their indexes, but it feels pretty dangerous).

There may be some other resolutions to this problem, such as catalog
hacks that allow for different representations of iculocale=C pre-16
and post-16. That doesn't sound great though, and we'd have to figure
out what to do with pg_dump.

(4) Create a new "none" provider (which has no locale and always memcmp
semantics), and automatically change the provider to "none" if
provider=icu and iculocale=C.

This solves the problem case in #2 and the potential upgrade problem in
#3. It also makes the documentation a bit more natural, in my opinion,
even if we retain the special case for provider=libc collate=C.

#4 is the approach I chose (patches 0001 and 0002), but I'd like to
hear what others think.

For historical reasons, users may assume that LC_COLLATE controls the
default collation order because that's true in libc. And if their
provider is ICU, they may be surprised that it doesn't. I believe we
could extend each of the above approaches to use LC_COLLATE as the
default for ICU_LOCALE if the former is specified and the latter is
not, and that may make things smoother.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v6-0001-Introduce-collation-provider-none.patch text/x-patch 35.2 KB
v6-0002-ICU-for-locale-C-automatically-use-none-provider-.patch text/x-patch 13.1 KB
v6-0003-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patch text/x-patch 15.6 KB
v6-0004-Add-default_collation_provider-GUC.patch text/x-patch 9.3 KB
v6-0005-ICU-fix-up-old-libc-style-locale-strings.patch text/x-patch 9.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-05-18 22:51:59 Re: PG 16 draft release notes ready
Previous Message Michael Paquier 2023-05-18 22:36:32 Re: Introduce WAIT_EVENT_EXTENSION and WAIT_EVENT_BUFFER_PIN