Re: Built-in CTYPE provider

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-12 08:24:14
Message-ID: 19b34a70-f5cf-4faf-88dc-917db44ed48d@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08.03.24 02:00, Jeff Davis wrote:
> And here's v22 (I didn't post v21).
>
> I committed Unicode property tables and functions, and the simple case
> mapping. I separated out the full case mapping changes (based on
> SpecialCasing.txt) into patch 0006.

> 0002: Basic builtin collation provider that only supports "C".

Overall, this patch looks sound.

In the documentation, let's make the list of locale providers an actual
list instead of a sequence of <sect3>s.

We had some discussion on initdb option --builtin-locale and whether it
should be something more general. I'm ok with leaving it like this for
now and maybe consider as an "open item" for PG17.

In

errmsg("parameter \"locale\" must be specified")

make "locale" a placeholder. (See commit 36a14afc076).

It seems the builtin provider accepts both "C" and "POSIX" as locale
names, but the documentation says it must be "C". Maybe we don't need
to accept "POSIX"? (Seeing that there are no plans for "POSIX.UTF-8",
maybe we just ignore the "POSIX" spelling altogether?)

Speaking of which, the code in postinit.c is inconsistent in that
respect with builtin_validate_locale(). Shouldn't postinit.c use
builtin_validate_locale(), to keep it consistent?

Or, there could be a general function that accepts a locale provider and
a locale string and validates everything together?

In initdb.c, this message

printf(_("The database cluster will be initialized with no locale.\n"));

sounds a bit confusing. I think it's ok to show "C" as a locale. I'm
not sure we need to change the logic here.

Also in initdb.c, this message

pg_fatal("locale must be specified unless provider is libc");

should be flipped around, like

locale must be specified if provider is %s

In pg_dump.c, dumpDatabase(), there are some new warning messages that
are not specifically about the builtin provider. Are those existing
deficiencies? It's not clear to me.

What are the changes in the pg_upgrade test about? Maybe explain the
scenario it is trying to test briefly?

> 0004: Inline some UTF-8 functions to improve performance

Makes sense that inlining can be effective here. But why aren't you
just inlining the existing function pg_utf_mblen()? Now we have two
functions that do the same thing. And the comment at pg_utf_mblen() is
removed completely, so it's not clear anymore why it exists.

> 0005: Add a unicode_strtitle() function and move the implementation for
> the builtin provider out of formatting.c.

In the recent discussion you had expression some uncertainty about the
detailed semantics of this. INITCAP() was copied from Oracle, so we
could check there for reference, too. Or we go with full Unicode
semantics. I'm not clear on all the differences and tradeoffs, if there
are any. In any case, it would be good if there were documentation or a
comment that somehow wrote down the resolution of this.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-03-12 08:28:19 Re: Transaction timeout
Previous Message Alexander Lakhin 2024-03-12 08:00:00 Re: DROP DATABASE is interruptible