Quick Links

Re: Built-in CTYPE provider

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Built-in CTYPE provider
Date:	2024-03-01 05:05:34
Message-ID:	3bc653b5d562ae9e2838b11cb696816c328a489a.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 2024-02-26 at 19:01 -0800, Jeff Davis wrote:
> * Right now you can't mix all of the full case mapping behavior with
> INITCAP(), it just does simple titlecase mapping. I'm not sure we
> want
> to get too fancy here; after all, INITCAP() is not a SQL standard
> function and it's documented in a narrow fashion that doesn't seem to
> leave a lot of room to be very smart. ICU does a few extra things
> beyond what I did:
> - it accepts a word break iterator to the case conversion function
> - it provides some built-in word break iterators
> - it also has some configurable "break adjustment" behavior[1][2]
> which re-aligns the start of the word, and I'm not entirely sure why
> that isn't done in the word break iterator or the titlecasing rules

Attached v19 which addresses this issue. It does proper Unicode
titlecasing with a word boundary iterator as an argument. For initcap,
it just uses a simple word boundary iterator that breaks whenever
isalnum() changes.

It came out cleaner this way, ultimately, and it feels more complete
even though the behavior isn't much different. It's also easier to
comment the relationship of the functions to Unicode. I removed
CaseKind from the public API but still use it internally to avoid code
duplication.

I made one other change, which is that (for now) I undid the UCS_BASIC
change until we are sure we want to change it. Instead, I have builtin
collations PG_C_UTF8 and PG_UNICODE_FAST. I used the name "FAST" to
indicate that the collation uses fast memcmp() rather than a real
collation, but the Unicode character support is all there (including
full case mapping). I'm open to suggestion here on naming.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v19-0001-Documentation-update-for-Standard-Collations.patch	text/x-patch	5.0 KB
v19-0002-Add-Unicode-property-tables.patch	text/x-patch	105.9 KB
v19-0003-Add-unicode-case-mapping-tables-and-functions.patch	text/x-patch	315.7 KB
v19-0004-Catalog-changes-preparing-for-builtin-collation-.patch	text/x-patch	48.5 KB
v19-0005-Introduce-collation-provider-builtin.patch	text/x-patch	86.5 KB
v19-0006-Add-builtin-collation-objects-PG_C_UTF8-and-PG_U.patch	text/x-patch	10.7 KB

In response to

Re: Built-in CTYPE provider at 2024-02-27 03:01:37 from Jeff Davis

Responses

Re: Built-in CTYPE provider at 2024-03-02 23:02:00 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2024-03-01 05:08:40	Re: Improve readability by using designated initializers when possible
Previous Message	Michael Paquier	2024-03-01 05:03:48	Re: ALTER TABLE SET ACCESS METHOD on partitioned tables