Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-01 05:05:34
Message-ID: 3bc653b5d562ae9e2838b11cb696816c328a489a.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2024-02-26 at 19:01 -0800, Jeff Davis wrote:
>  * Right now you can't mix all of the full case mapping behavior with
> INITCAP(), it just does simple titlecase mapping. I'm not sure we
> want
> to get too fancy here; after all, INITCAP() is not a SQL standard
> function and it's documented in a narrow fashion that doesn't seem to
> leave a lot of room to be very smart. ICU does a few extra things
> beyond what I did:
>   - it accepts a word break iterator to the case conversion function
>   - it provides some built-in word break iterators
>   - it also has some configurable "break adjustment" behavior[1][2]
> which re-aligns the start of the word, and I'm not entirely sure why
> that isn't done in the word break iterator or the titlecasing rules

Attached v19 which addresses this issue. It does proper Unicode
titlecasing with a word boundary iterator as an argument. For initcap,
it just uses a simple word boundary iterator that breaks whenever
isalnum() changes.

It came out cleaner this way, ultimately, and it feels more complete
even though the behavior isn't much different. It's also easier to
comment the relationship of the functions to Unicode. I removed
CaseKind from the public API but still use it internally to avoid code
duplication.

I made one other change, which is that (for now) I undid the UCS_BASIC
change until we are sure we want to change it. Instead, I have builtin
collations PG_C_UTF8 and PG_UNICODE_FAST. I used the name "FAST" to
indicate that the collation uses fast memcmp() rather than a real
collation, but the Unicode character support is all there (including
full case mapping). I'm open to suggestion here on naming.

Regards,
Jeff Davis

Attachment Content-Type Size
v19-0001-Documentation-update-for-Standard-Collations.patch text/x-patch 5.0 KB
v19-0002-Add-Unicode-property-tables.patch text/x-patch 105.9 KB
v19-0003-Add-unicode-case-mapping-tables-and-functions.patch text/x-patch 315.7 KB
v19-0004-Catalog-changes-preparing-for-builtin-collation-.patch text/x-patch 48.5 KB
v19-0005-Introduce-collation-provider-builtin.patch text/x-patch 86.5 KB
v19-0006-Add-builtin-collation-objects-PG_C_UTF8-and-PG_U.patch text/x-patch 10.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-03-01 05:08:40 Re: Improve readability by using designated initializers when possible
Previous Message Michael Paquier 2024-03-01 05:03:48 Re: ALTER TABLE SET ACCESS METHOD on partitioned tables