Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-02-27 03:01:37
Message-ID: 4a69d067374d2f6bfb66f5bfb2ab9a020493d49f.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2024-02-13 at 07:24 +0100, Peter Eisentraut wrote:
> It is my understanding that "correct" Unicode case conversion needs
> to
> use at least some parts of SpecialCasing.txt.
...
> I think we need to use the "Unconditional" mappings and the
> "Conditional
> Language-Insensitive" mappings (which is just Greek sigma). 
> Obviously,
> skip the "Language-Sensitive" mappings.

Attached a new series.

Overall I'm quite happy with this feature as well as the recent
updates. It expands a lot on what behavior we can actually document;
the character semantics are nearly as good as ICU; it's fast; and it
eliminates what is arguably the last reason to use libc ("C collation
combined with some other CTYPE").

Changes:

* Added a doc update for the "standard collations" (tiny patch, mostly
separate) which clarifies the collations that are always available, and
describes them a bit better

* Added built-in locale "UCS_BASIC" (is that name confusing?) which
uses full case mapping and the standard properties:
- "ß" uppercases to "SS"
- "Σ" usually lowercases to "σ", except when the Final_Sigma
condition is met, in which case it lowercases to "ς"
- initcap() uses titlecase variants ("dž" changes to "Dž")
- in patterns/regexes, symbols (like "=") are not treated as
punctuation

* Changed the UCS_BASIC collation to use the builtin "UCS_BASIC"
locale with Unicode semantis. At first I was skeptical because it's a
behavior change, and I am still not sure we want to do that. But doing
so would take us closer to both the SQL spec as well as Unicode; and
also this kind of character behavior change is less likely to cause a
problem than a collation behavior change.

* The built-in locale "C.UTF-8" still exists, which uses Unicode
simple case mapping and the POSIX compatible properties (no change
here).

Implementation-wise:

* I introduced the CaseKind enum, which seemed to clean up a few
things and reduce code duplication between upper/lower/titlecase. It
also leaves room for introducing case folding later.

* Introduced a "case-ignorable" table to properly implement the
Final_Sigma rule.

Loose ends:

* Right now you can't mix all of the full case mapping behavior with
INITCAP(), it just does simple titlecase mapping. I'm not sure we want
to get too fancy here; after all, INITCAP() is not a SQL standard
function and it's documented in a narrow fashion that doesn't seem to
leave a lot of room to be very smart. ICU does a few extra things
beyond what I did:
- it accepts a word break iterator to the case conversion function
- it provides some built-in word break iterators
- it also has some configurable "break adjustment" behavior[1][2]
which re-aligns the start of the word, and I'm not entirely sure why
that isn't done in the word break iterator or the titlecasing rules

Regards,
Jeff Davis

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e

Attachment Content-Type Size
v18-0001-Documentation-update-for-Standard-Collations.patch text/x-patch 5.0 KB
v18-0002-Add-Unicode-property-tables.patch text/x-patch 105.9 KB
v18-0003-Add-unicode-case-mapping-tables-and-functions.patch text/x-patch 311.6 KB
v18-0004-Catalog-changes-preparing-for-builtin-collation-.patch text/x-patch 48.5 KB
v18-0005-Introduce-collation-provider-builtin.patch text/x-patch 85.7 KB
v18-0006-Change-collation-UCS_BASIC-to-use-Unicode-semant.patch text/x-patch 11.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-02-27 03:21:20 Re: Streaming I/O, vectored I/O (WIP)
Previous Message Michael Paquier 2024-02-27 02:51:23 Re: Fix incorrect PG_GETARG in pgcrypto