Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-21 00:13:26
Message-ID: a21b515000f121244db2a5570d4292d19df23379.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote:
> * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch
>
> Looks ok.

Committed.

> * v25-0003-Inline-basic-UTF-8-functions.patch

Committed.

> * v25-0004-Use-version-for-builtin-collations.patch
>
> Not sure about the version format "1.0", which implies some sort of
> major/minor or component-based system.  I would just use "1".

The v26 patch was not quite complete, so I didn't commit it yet.
Attached v27-0001 and 0002.

0002 is necessary because otherwise lc_collate_is_c() short-circuits
the version check in pg_newlocale_from_collation(). With 0002, the code
is simpler and all paths go through pg_newlocale_from_collation(), and
the version check happens even when lc_collate_is_c().

But perhaps there was a reason the code was the way it was, so
submitting for review in case I missed something.

> 0005 and 0006 don't contain any test cases.  So I guess they are
> really
> only usable via 0007.  Is that understanding correct?

0005 is not a functional change, it's just a refactoring to use a
callback, which is preparation for 0007.

> Are there any test cases that illustrate the word boundary changes in
> patch 0005?  It might be useful to test those against Oracle as well.

The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
collation vs '123Abc' in PG_UNICODE_FAST.

The reason for the latter behavior is that the Unicode Default Case
Conversion algorithm for toTitlecase() advances to the next Cased
character before mapping to titlecase, and digits are not Cased. ICU
has a configurable adjustment, and defaults in a way that produces
'123abc'.

New rebased series attached.

Regards,
Jeff Davis

Attachment Content-Type Size
v27-0001-Use-version-for-builtin-collations.patch text/x-patch 2.7 KB
v27-0002-Simplify-lookup_collation_cache.patch text/x-patch 5.6 KB
v27-0003-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch text/x-patch 9.2 KB
v27-0004-Support-Unicode-full-case-mapping-and-conversion.patch text/x-patch 555.7 KB
v27-0005-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch text/x-patch 21.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melih Mutlu 2024-03-21 00:24:48 Re: Flushing large data immediately in pqcomm
Previous Message Bharath Rupireddy 2024-03-20 23:49:05 Re: Introduce XID age and inactive timeout based replication slot invalidation