Re: Order changes in PG16 since ICU introduction

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Joe Conway <mail(at)joeconway(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-06-09 15:55:40
Message-ID: addce4a61a57631e8e71a753a332c5ed23c9ad2f.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2023-06-09 at 14:12 +0200, Daniel Verite wrote:
> >  I implemented a compromise where initdb will
> >  change C.UTF-8 to the built-in provider
>
> $ initdb --locale=C.UTF-8

...

> This setup is not what the user has asked for and leads to that kind
> of
> wrong results:
>
> $ psql -c "select upper('é')"
>  ?column?
> ----------
>  é
>
> whereas in v15 we would get the correct result 'É'.

I guess where I'm confused is: why would a user actually want their
database collation to be C.UTF-8? It's slower than C, our
implementation doesn't properly version it (as you pointed out), and
the semantics don't seem great ('Z' < 'a').

If the user specifies provider=libc, then of course we should honor
that and C.UTF-8 is a valid locale for libc.

But if they don't specify the provider, isn't it much more likely they
just don't care much about the locale, and would be happier with C? 

Perhaps there's some better compromise here than the one I picked, but
I see this as a fairly small problem in comparison to the big problems
that we're solving.

> In general about the evolution of the patchset, your interpretation
> of "defaulting to ICU" seems to be "avoid libc at any cost", which
> IMV
> is unreasonably user-hostile.

The user can easily get libc behavior by specifying --locale-
provider=libc, so I don't see how you reached this conclusion.

Let me try to understand and address the points you raised here[1] in
more detail:

It looks like you are fine with 0003 applying LOCALE to whatever
provider is chosen, but you'd like to be smarter about choosing the
provider and to choose libc in at least some cases.

That is actually very much like option #2 in the list I presented
here[2], and has the same problems. How should the following behave?

initdb --locale=C --lc-collate=fr_FR.utf8
initdb --locale=en --lc-collate=fr_FR.utf8

If we switch to libc in the first case, then --locale will be ignored
and the collation will be fr_FR.utf8. But we will leave the second case
as ICU and the collation will be "en". I'm sure we can come up with
something there, but it feels like there's more room for confusion
along this path, and the builtin provider seems cleaner.

You also suggested that we consider switching the provider to libc any
time ICU doesn't support something. I'm not sure whether you meant a
static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test. I'm
skeptical of being too smart here, but I'd like to hear what you mean.
I'm also not clear whether you think we should abandon the built-in
provider, or still select it for C/POSIX.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/7de2dc15-5211-45b3-afcb-71dcaf7a08bb@manitou-mail.org

[2]
https://www.postgresql.org/message-id/daa9f060aa2349ebc84444515efece49e7b32c5d.camel@j-davis.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-06-09 16:08:30 Re: ERROR: wrong varnullingrels (b 3) (expected (b)) for Var 2/1
Previous Message Matthias van de Meent 2023-06-09 15:53:52 Re: Let's make PostgreSQL multi-threaded