Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-27 17:40:19
Message-ID: 2f404017690b43e6951cd4a60798c3f9626bbe56.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote:
>  provider | isalpha | isdigit
> ----------+---------+---------
>  ICU      | f       | t
>  glibc    | t       | f
>  builtin  | f       | f

The "ICU" above is really the behvior of the Postgres ICU provider as
we implemented it, it's not something forced on us by ICU.

For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and
pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are
defined by ICU to be equivalent to java.lang.Character.isLetter() and
java.lang.Character.isDigit().

ICU documents[3] how regex character classes should be implemented
using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite
being under the heading "...for C/POSIX character classes...", [3] says
it's based on the "Standard" variant of [4], rather than "POSIX
Compatible".

(Aside: the Postgres ICU provider doesn't match what [3] suggests for
the "alpha" class. For the character U+FF11 it doesn't matter, but I
suspect there are differences for other characters. This should be
fixed.)

The differences between PG_C_UTF8 and what ICU suggests are just
because the former uses the "POSIX Compatible" definitions and the
latter uses "Standard".

I implemented both the "Standard" and "POSIX Compatible" compatibility
properties in ad49994538, so it would be easy to change what PG_C_UTF8
uses.

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6
[3]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
[4]
http://www.unicode.org/reports/tr18/#Compatibility_Properties

> Are we fine with pg_c_utf8 differing from both ICU's point of view
> (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is
> not
> digit, but it's alpha)?

Yes, some differences are to be expected.

But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as
we can point to something other than "glibc version 2.35 does it this
way".

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-03-27 17:49:21 Re: Add new error_action COPY ON_ERROR "log"
Previous Message Nathan Bossart 2024-03-27 17:35:09 Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs