Re: Built-in CTYPE provider

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-22 11:26:43
Message-ID: f7dd3ff4-5f1c-4a0f-8a3c-0a521d35b001@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:

> For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8,
> a change to C.utf8 would be a much bigger problem, I would
> think. Their alphabet isn't in code point order, and so things would
> be alphabetized wrongly.

> That might be OK if they don't care about ordering for any purpose
> other than equality lookups, but otherwise it's going to force them
> to change the default, where today they don't have to do that.

Sure, in whatever collation setup we expose, we need to keep
it possible and even easy to sort properly with linguistic rules.

But some reasons to use $LANG as the default locale/collation
are no longer as good as they used to be, I think.

Starting with v10/ICU we have many pre-created ICU locales with
fixed names, and starting with v16, we can simply write "ORDER BY
textfield COLLATE unicode" which is good enough in most cases. So
the configuration "bytewise sort by default" / "linguistic sort on-demand"
has become more realistic.

By contrast in the pre-v10 days with only libc collations, an
application could have no idea which collations were going to be
available on the server, and how they were named precisely, as this
varies across OSes and across installs even with the same OS.
On Windows, I think that before v16 initdb did not create any libc
collation beyond C/POSIX and the default language/region of the OS.

In that libc context, if a db wants the C locale by default for
performance and truly immutable indexes, but the client app needs to
occasionally do in-db linguistic sorts, the app needs to figure out
which collation name will work for that. This is hard if you don't
target a specific installation that guarantees that such or such
collation is going to be installed.
Whereas if the linguistic locale is the default, the app never needs
to know its name or anything about it. So it's done that way,
linguistic by default. But that leaves databases with many
indexes sorted linguistically instead of bytewise for fields
that semantically never need any linguistic sort.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-12-22 11:39:23 Re: Trigger violates foreign key constraint
Previous Message Laurenz Albe 2023-12-22 11:00:33 Re: Set log_lock_waits=on by default