Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-28 01:26:35
Message-ID: 6b1370d5eaba5e8c42f54c05f7bc2b8e27b8db12.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
>
> But C.UTF-8 is not available everywhere, and there's still the
> problem that Unicode updates through libc are not aligned
> with Postgres releases.

Attached is an implementation of a built-in provider for the "C.UTF-8"
locale. That way applications (and tests!) can count on C.UTF-8 always
being available on any platform; and it also aligns with the Postgres
Unicode updates. Documentation is sparse and the patch is a bit rough,
but feedback is welcome -- it does have some basic tests which can be
used as a guide.

The C.UTF-8 locale, briefly, is a UTF-8 locale that provides simple
collation semantics (code point order) but rich ctype semantics
(lower/upper/initcap and regexes). This locale is for users who want
proper Unicode semantics for character operations (upper/lower,
regexes), but don't need a specific natural-language string sort order
to apply to all queries and indexes in their system. One might use it
as the database default collation, and use COLLATE clauses (i.e.
COLLATE UNICODE) where more specific behavior is needed.

The builtin C.UTF-8 locale has the following advantages over using the
libc C.UTF-8 locale:

* Collation performance: the builtin provider uses memcmp and
abbreviated keys. In libc, these advantages are only available for the
C locale.

* Unicode version is aligned with other parts of Postgres, like
normalization.

* Available on all platforms with exactly the same semantics.

* Testable and documentable.

* Avoids index corruption risks. In theory libc C.UTF-8 should also
have stable collation, but that is not 100% true. In the builtin
provider it is 100% stable.

Regards,
Jeff Davis

Attachment Content-Type Size
v14-0001-Minor-cleanup-for-unicode-update-build-and-test.patch text/x-patch 7.4 KB
v14-0002-Add-Unicode-property-tables.patch text/x-patch 91.4 KB
v14-0003-Add-unicode-case-mapping-tables-and-functions.patch text/x-patch 140.3 KB
v14-0004-Catalog-changes-preparing-for-builtin-collation-.patch text/x-patch 46.3 KB
v14-0005-Introduce-collation-provider-builtin-for-C-and-C.patch text/x-patch 63.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2023-12-28 02:41:31 Re: Statistics Import and Export
Previous Message Justin Pryzby 2023-12-27 22:55:34 Re: cannot abort transaction 2737414167, it was already committed