Re: Built-in CTYPE provider

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-01-15 14:30:16
Message-ID: ae044bfb-682b-449f-ad4a-c46e4332ee48@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis wrote:

> New version attached.

[v16]

Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:

category_test: Postgres Unicode version: 15.1
category_test: ICU Unicode version: 14.0
category_test: FAILURE for codepoint 0x000c04
category_test: Postgres property
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
1/0/0/0/0/0
category_test: ICU property
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
0/0/0/0/0/0

U+0C04 is a codepoint added in Unicode 11.
https://en.wikipedia.org/wiki/Telugu_(Unicode_block)

In Unicode.txt:
0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;;;;;N;;;;;

In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt
$ grep 0C04 PropList.txt
0C04 ; Other_Alphabetic # Mn TELUGU SIGN COMBINING ANUSVARA
ABOVE

But in Unicode 14 it was not there.
As a result its binary property UCHAR_ALPHABETIC has changed from
false to true in ICU 72 vs previous versions.

As I understand, the stability policy says that such things happen.
From https://www.unicode.org/policies/stability_policy.html

Once a character is encoded, its properties may still be changed,
but not in such a way as to change the fundamental identity of the
character.

The Consortium will endeavor to keep the values of the other
properties as stable as possible, but some circumstances may arise
that require changing them. Particularly in the situation where
the Unicode Standard first encodes less well-documented characters
and scripts, the exact character properties and behavior initially
may not be well known.

As more experience is gathered in implementing the characters,
adjustments in the properties may become necessary. Examples of
such properties include, but are not limited to, the following:

- General_Category
- Case mappings
- Bidirectional properties
[...]

I've commented the exit(1) in category_test to collect all errors, and
built it with versions of ICU from 74 down to 60 (that is Unicode 10.0).
Results are attached. As expected, the older the ICU version, the more
differences are found against Unicode 15.1.

I find these results interesting because they tell us what contents
can break regexp-based check constraints on upgrades.

But about category_test as a pass-or-fail kind of test, it can only be
used when the Unicode version in ICU is the same as in Postgres.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment Content-Type Size
results-category-tests-multiple-icu-versions.tar.bz2 application/octet-stream 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-01-15 14:40:30 Re: Add test module for Table Access Method
Previous Message Konstantin Knizhnik 2024-01-15 14:22:18 Re: Custom explain options