Re: Built-in CTYPE provider

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-27 15:53:33
Message-ID: 610d7f1b-c68c-4eb8-a03d-1515da304c58@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis wrote:

> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
>
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character before mapping to titlecase, and digits are not Cased. ICU
> has a configurable adjustment, and defaults in a way that produces
> '123abc'.

Even aside from ICU, there's a different behavior between glibc
and pg_c_utf8 glibc for codepoints in the decimal digit category
outside of the US-ASCII range '0'..'9',

select initcap(concat(chr(0xff11), 'a') collate "C.utf8"); -- glibc 2.35
initcap
---------
1a

select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8");
initcap
---------
1A

Both collations consider that chr(0xff11) is not a digit
(isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8
says it's neither digit nor alpha.

AFAIU this is why in the above initcap() call, pg_c_utf8 considers
that 'a' is the first alphanumeric, whereas C.utf8 considers that '1'
is the first alphanumeric, leading to different capitalizations.

Comparing the 3 providers:

WITH v(provider,type,result) AS (values
('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"),
('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"),
('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"),
('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"),
('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"),
('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8")
)
select * from v
\crosstabview

provider | isalpha | isdigit
----------+---------+---------
ICU | f | t
glibc | t | f
builtin | f | f

Are we fine with pg_c_utf8 differing from both ICU's point of view
(U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not
digit, but it's alpha)?

Aside from initcap(), this is going to be significant for regular
expressions.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-03-27 15:54:22 Re: Flushing large data immediately in pqcomm
Previous Message Regina Obe 2024-03-27 15:50:55 Can't compile PG 17 (master) from git under Msys2 autoconf