Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-02 23:02:00
Message-ID: 163f4e2190cdf67f67016044e503c5004547e5a9.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2024-02-29 at 21:05 -0800, Jeff Davis wrote:
> Attached v19 which addresses this issue.

I pushed the doc patch.

Attached v20. I am going to start pushing some other patches. v20-0001
(property tables) and v20-0003 (catalog iculocale -> locale) have been
stable for a while so are likely to go in soon. v20-0002 (case mapping)
also feels close to me, but it went through significant changes to
support full case mapping and titlecasing, so I'll see if there are
more comments.

Changes in v20:

* For titlecasing with the builtin "C.UTF-8" locale, do not perform
word break adjustment, so it matches libc's "C.UTF-8" titlecasing
behavior more closely.

* Add optimized table for ASCII code points when determining
categories and properties (this was already done for the case mapping
table).

* Add a small patch to make UTF-8 functions inline, which speeds
things up substantially.

Performance:

ASCII-only data:

lower initcap upper

"C" (libc) 2426 3326 2341
pg_c_utf8 2890 6570 2825
pg_unicode_fast 2929 7140 2893
"C.utf8" (libc) 5410 7810 5397
"en-US-x-icu" 8320 65732 9367

Including non-ASCII data:

lower initcap upper

"C" (libc) 2630 4677 2548
pg_c_utf8 5471 10682 5431
pg_unicode_fast 5582 12023 5587
"C.utf8" (libc) 8126 11834 8106
"en-US-x-icu" 14473 73655 15112

The new builtin collations nicely finish ahead of everything except "C"
(with an exception where pg_unicode_fast is marginally slower at
titlecasing non-ASCII data than libc "C.UTF-8", which is likely due to
the word break adjustment semantics).

I suspect the inlined UTF-8 functions also speed up a few other areas,
but I didn't measure.

Regards,
Jeff Davis

Attachment Content-Type Size
v20-0001-Add-Unicode-property-tables.patch text/x-patch 121.6 KB
v20-0002-Add-unicode-case-mapping-tables-and-functions.patch text/x-patch 317.3 KB
v20-0003-Catalog-changes-preparing-for-builtin-collation-.patch text/x-patch 48.5 KB
v20-0004-Introduce-collation-provider-builtin.patch text/x-patch 87.0 KB
v20-0005-Add-builtin-collation-objects-PG_C_UTF8-and-PG_U.patch text/x-patch 10.8 KB
v20-0006-Inline-basic-UTF-8-functions.patch text/x-patch 6.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-03-02 23:07:48 Re: Streaming read-ready sequential scan code
Previous Message Melanie Plageman 2024-03-02 22:52:49 Re: BitmapHeapScan streaming read user and prelim refactoring