Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-14 20:42:20
Message-ID: 7451f81ba0cb512222ab759de8ca1cffe44e9acb.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2024-03-14 at 09:54 +0100, Peter Eisentraut wrote:
> * doc/src/sgml/charset.sgml
>
> I don't understand the purpose of this sentence:
>
> "When using this locale, the behavior may depend on the database
> encoding."

The "C" locale (in either the builtin or libc provider) can sort
differently in different encodings, because it's based on memcmp. For
instance:

select U&'\20AC' > U&'\201A' collate "C";

Returns true in UTF-8 and false in WIN1252. That's why UCS_BASIC is
only available in UTF-8, because (at least for some encodings) we'd
have to decode before comparison to get the code-point-order semantics
right.

In other words, the "C" collation is not a well-defined order, but
UCS_BASIC and C.UTF-8 are well-defined.

Suggestions for better wording are welcome.

> * doc/src/sgml/ref/create_database.sgml
>
> The new parameter builtin_locale is not documented.

Thank you, fixed in 0001 (review fixup).

> * src/backend/commands/collationcmds.c
>
> I think DefineCollation() should set collencoding = -1 for the
> COLLPROVIDER_BUILTIN case.  -1 stands for any encoding.  Or at least
> explain why not?

In the attached v25-0001 (review fixup) I have made it the
responsibility of a function, and then extended that for the C.UTF-8
(0002) and PG_UNICODE_FAST locales (0007).

> * src/backend/utils/adt/pg_locale.c
>
> This part is a bit confusing:
>
> +           cache_entry->collate_is_c = true;
> +           cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0);
>
> Is collate always C but ctype only sometimes?  Does this anticipate
> future patches in this series?  Maybe in this patch it should always
> be true?

Made it a constant in v25-0001, and changed it in 0002

>
> * src/bin/initdb/initdb.c
>
> +   printf(_("      --builtin-locale=LOCALE   set builtin locale name
> for new databases\n"));
>
> Put in a line break so that the right "column" lines up.

Fixed in 0001

> This output should line up better:
>
> The database cluster will be initialized with this locale
> configuration:
>    default collation provider:  icu
>    default collation locale:    en
>    LC_COLLATE:  C
>    LC_CTYPE:    C
>    ...
>
> Also, why are there two spaces after "provider:  "?
>
> Also we call these locale provider on input, why are they collation
> providers on output?  What is a "collation locale"?

I tried to fix these things in 0001.

> * src/bin/pg_upgrade/t/002_pg_upgrade.pl
>
> +if ($oldnode->pg_version >= '17devel')
>
> This is weird.  >= is a numeric comparison, so providing a string
> with
> non-digits is misleading at best.

It's actually not a numeric comparison, it's an overloaded comparison
op for the Version class.

See 32dd2c1eff and:
https://www.postgresql.org/message-id/1738174.1710274577%40sss.pgh.pa.us

> * src/test/icu/t/010_database.pl
>
> -# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE
> -# are specified
>
> Why remove this test?

It must have been lost during a rebase, fixed in 0001.

> Change the name of the new database to be different from the name of
> the template database.

Fixed in 0001.

New series attached.

Regards,
Jeff Davis

Attachment Content-Type Size
v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch text/x-patch 7.9 KB
v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch text/x-patch 29.8 KB
v25-0003-Inline-basic-UTF-8-functions.patch text/x-patch 4.4 KB
v25-0004-Use-version-for-builtin-collations.patch text/x-patch 1.8 KB
v25-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch text/x-patch 9.4 KB
v25-0006-Support-Unicode-full-case-mapping-and-conversion.patch text/x-patch 555.7 KB
v25-0007-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch text/x-patch 18.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-03-14 20:42:28 Re: Built-in CTYPE provider
Previous Message David Steele 2024-03-14 20:40:38 Re: Add basic tests for the low-level backup method.