Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-29 02:57:16
Message-ID: 804eb67b37f41d3afeb2b6469cbe8bfa79c562cc.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached is an implementation of a built-in provider for the "C.UTF-
> 8"

Attached a more complete version that fixes a few bugs, stabilizes the
tests, and improves the documentation. I optimized the performance, too
-- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
collation and case mapping (numbers below).

It's really nice to finally be able to have platform-independent tests
that work on any UTF-8 database.

Simple character classification:

SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8;

Case mapping is more interesting (note that accented characters are
being properly mapped, and it's using the titlecase variant "Dž"):

SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8);
initcap
--------------------------
Axxe Áxxé Džxxdž Džxxx Džxxx

Even more interesting -- test that non-latin characters can still be a
member of a case-insensitive range:

-- capital delta is member of lowercase range gamma to lambda
SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8;
-- small delta is member of uppercase range gamma to lambda
SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8;

Moreover, a lot of this behavior is locked in by strong Unicode
guarantees like [1] and [2]. Behavior that can change probably won't
change very often, and in any case will be tied to a PG major version.

All of these behaviors are very close to what glibc "C.utf8" does on my
machine. The case transformations are identical (except titlecasing
because libc doesn't support it). The character classifications have
some differences, which might be worth discussing, but I didn't see
anything terribly concerning (I am following the unicode
recommendations[3] on this topic).

Performance:

Sotring 10M strings:
libc "C" 14s
builtin C_UTF8 14s
libc "C.utf8" 20s
ICU "en-US-x-icu" 31s

Running UPPER() on 10M strings:
libc "C" 03s
builtin C_UTF8 07s
libc "C.utf8" 08s
ICU "en-US-x-icu" 15s

I didn't investigate or optimize regexes / pattern matching yet, but I
can do similar optimizations if there's any gap.

Note that I implemented the "simple" case mapping (which is what glibc
does) and the "posix compatible"[3] flavor of character classification
(which is closer to what glibc does than the "standard" flavor"). I
opted to use title case mapping for initcap(), which is a difference
from libc and I may go back to just upper/lower. These seem like
reasonable choices if we're going to name the locale after C.UTF-8.

Regards,
Jeff Davis

[1] https://www.unicode.org/policies/stability_policy.html#Case_Pair
[2] https://www.unicode.org/policies/stability_policy.html#Identity
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties

Attachment Content-Type Size
v15-0001-Minor-cleanup-for-unicode-update-build-and-test.patch text/x-patch 7.4 KB
v15-0002-Add-Unicode-property-tables.patch text/x-patch 91.5 KB
v15-0003-Add-unicode-case-mapping-tables-and-functions.patch text/x-patch 144.8 KB
v15-0004-Catalog-changes-preparing-for-builtin-collation-.patch text/x-patch 46.2 KB
v15-0005-Introduce-collation-provider-builtin-for-C-and-C.patch text/x-patch 73.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2023-12-29 03:23:59 Re: POC: GROUP BY optimization
Previous Message Masahiko Sawada 2023-12-29 01:47:39 Re: Synchronizing slots from primary to standby