Quick Links

Re: Built-in CTYPE provider

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Built-in CTYPE provider
Date:	2023-12-29 02:57:16
Message-ID:	804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached is an implementation of a built-in provider for the "C.UTF-
> 8"

Attached a more complete version that fixes a few bugs, stabilizes the
tests, and improves the documentation. I optimized the performance, too
-- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
collation and case mapping (numbers below).

It's really nice to finally be able to have platform-independent tests
that work on any UTF-8 database.

Simple character classification:

SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8;

Case mapping is more interesting (note that accented characters are
being properly mapped, and it's using the titlecase variant "ǅ"):

SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx' COLLATE C_UTF8);
initcap
--------------------------
Axxe Áxxé ǅxxǆ ǅxxx ǅxxx

Even more interesting -- test that non-latin characters can still be a
member of a case-insensitive range:

-- capital delta is member of lowercase range gamma to lambda
SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8;
-- small delta is member of uppercase range gamma to lambda
SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8;

Moreover, a lot of this behavior is locked in by strong Unicode
guarantees like [1] and [2]. Behavior that can change probably won't
change very often, and in any case will be tied to a PG major version.

All of these behaviors are very close to what glibc "C.utf8" does on my
machine. The case transformations are identical (except titlecasing
because libc doesn't support it). The character classifications have
some differences, which might be worth discussing, but I didn't see
anything terribly concerning (I am following the unicode
recommendations[3] on this topic).

Performance:

Sotring 10M strings:
libc "C" 14s
builtin C_UTF8 14s
libc "C.utf8" 20s
ICU "en-US-x-icu" 31s

Running UPPER() on 10M strings:
libc "C" 03s
builtin C_UTF8 07s
libc "C.utf8" 08s
ICU "en-US-x-icu" 15s

I didn't investigate or optimize regexes / pattern matching yet, but I
can do similar optimizations if there's any gap.

Note that I implemented the "simple" case mapping (which is what glibc
does) and the "posix compatible"[3] flavor of character classification
(which is closer to what glibc does than the "standard" flavor"). I
opted to use title case mapping for initcap(), which is a difference
from libc and I may go back to just upper/lower. These seem like
reasonable choices if we're going to name the locale after C.UTF-8.

Regards,
Jeff Davis

[1] https://www.unicode.org/policies/stability_policy.html#Case_Pair
[2] https://www.unicode.org/policies/stability_policy.html#Identity
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties

Attachment	Content-Type	Size
v15-0001-Minor-cleanup-for-unicode-update-build-and-test.patch	text/x-patch	7.4 KB
v15-0002-Add-Unicode-property-tables.patch	text/x-patch	91.5 KB
v15-0003-Add-unicode-case-mapping-tables-and-functions.patch	text/x-patch	144.8 KB
v15-0004-Catalog-changes-preparing-for-builtin-collation-.patch	text/x-patch	46.2 KB
v15-0005-Introduce-collation-provider-builtin-for-C-and-C.patch	text/x-patch	73.0 KB

In response to

Re: Built-in CTYPE provider at 2023-12-28 01:26:35 from Jeff Davis

Responses

Re: Built-in CTYPE provider at 2024-01-09 01:17:48 from Jeremy Schneider
Re: Built-in CTYPE provider at 2024-01-09 22:17:44 from Jeremy Schneider
Re: Built-in CTYPE provider at 2024-01-10 22:56:23 from Daniel Verite

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrei Lepikhov	2023-12-29 03:23:59	Re: POC: GROUP BY optimization
Previous Message	Masahiko Sawada	2023-12-29 01:47:39	Re: Synchronizing slots from primary to standby