Re: pg_collation.collversion for C.UTF-8

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_collation.collversion for C.UTF-8
Date: 2023-06-19 18:47:56
Message-ID: eb571cba776b07a568fb4618d87356aeb461d1a7.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:
>
> > Would it be correct to interpret LC_COLLATE=C.UTF-8 as
> > LC_COLLATE=C,
> > but leave LC_CTYPE=C.UTF-8 as-is?
>
> Yes.  The basic idea, at least for these two OSes, is that every
> category behaves as if set to C, except LC_CTYPE.

If that's true, and we version C.UTF-8, then users could still get the
behavior they want, a stable collation order, and benefit from the
optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.

The only caveat is to be careful with things that depend on ctype in
indexes and constraints. While still a problem, it's a smaller problem
than unversioned collation. We should think a little more about solving
it, because I think there's a strong case to be made that a default
collation of C and a database ctype of something else is a good
combination (it makes less sense for a case-insensitive collation, but
those aren't allowed as a default collation).

In any case, we're better off following the rule "version anything that
goes to any external provider, period". And by "version", I really mean
a best effort, because we don't always have great information, but I
think it's better to record what we do have than not. We have just seen
too many examples of weird behavior. On top of that, it's simply
inconsistent to assume that C=C.UTF-8 for collation version, but not
for the collation implementation.

Users might get frustrated that the collation for C.UTF-8 is versioned,
of course. But I don't think it will affect anyone for quite some time,
because existing users will have a datcollversion=NULL; so they won't
get the warnings until they refresh the versions (or create new
collations/databases), and then after that upgrade libc. Right? So they
should have time to adjust to use LC_COLLATE=C if that's what they
want.

An alternative would be to define lc_collate_is_c("C.UTF-8") == true
while leaving lc_ctype_is_c("C.UTF-8") == false and
get_collation_actual_version("C.UTF-8") == NULL. In that case we would
not be passing it to an external provider, so we don't have to version
it. But that might be a little too magical and I'm not inclined to do
that.

Another alternative would be to implement C.UTF-8 internally according
to the "true" semantics, if they are truly simple and well-defined and
stable. But I don't think ctype=C.UTF-8 is actually stable because new
characters can be added, right?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-06-19 19:27:46 Re: index prefetching
Previous Message Andres Freund 2023-06-19 18:16:56 Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call