Re: Pre-proposal: unicode normalized text

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-06 07:58:37
Message-ID: c3efcb28-5285-d668-a835-84d5ffb73721@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03.10.23 21:54, Jeff Davis wrote:
>> Here, Jeff mentions normalization, but I think it's a major issue
>> with
>> collation support. If new code points are added, users can put them
>> into the database before they are known to the collation library, and
>> then when they become known to the collation library the sort order
>> changes and indexes break.
>
> The collation version number may reflect the change in understanding
> about assigned code points that may affect collation -- though I'd like
> to understand whether this is guaranteed or not.

This is correct. The collation version number produced by ICU contains
the UCA version, which is effectively the Unicode version (14.0, 15.0,
etc.). Since new code point assignments can only come from new Unicode
versions, a new assigned code point will always result in a different
collation version.

For example, with ICU 70 / CLDR 40 / Unicode 14:

select collversion from pg_collation where collname = 'unicode';
= 153.112

With ICU 72 / CLDR 42 / Unicode 15:
= 153.120

> At minimum I think we need to have some internal functions to check for
> unassigned code points. That belongs in core, because we generate the
> unicode tables from a specific version.

If you want to be rigid about it, you also need to consider whether the
Unicode version used by the ICU library in use matches the one used by
the in-core tables.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-10-06 08:02:50 Re: Use FD_CLOEXEC on ListenSockets (was Re: Refactoring backend fork+exec code)
Previous Message Kuwamura Masaki 2023-10-06 07:53:20 Re: pg_rewind with cascade standby doesn't work well