Pre-proposal: unicode normalized text

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Pre-proposal: unicode normalized text
Date: 2023-09-12 22:47:10
Message-ID: f30b58657ceb71d5be032decf4058d454cc1df74.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:

SELECT 'á' = 'á'; -- false

because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character with just a different representation, and if you normalize
them they are equal:

SELECT normalize('á') = normalize('á'); -- true

The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().

Unicode guarantees that "the results of normalizing a string on one
version will always be the same as normalizing it on any other version,
as long as the string contains only assigned characters according to
both versions"[1]. It also guarantees that it "will not reallocate,
remove, or reassign" characters[2]. That means that we can normalize in
a forward-compatible way as long as we don't allow the use of
unassigned code points.

I looked at the standard to see what it had to say, and is discusses
normalization, but a standard UCS string with an unassigned code point
is not an error. Without a data type to enforce the constraint that
there are no unassigned code points, we can't guarantee forward
compatibility. Some other systems support NVARCHAR, but I didn't see
any guarantee of normalization or blocking unassigned code points
there, either.

UTEXT benefits:
* slightly better natural language semantics than TEXT with
deterministic collation
* still deterministic=true
* fast memcmp()-based comparisons
* no breaking semantic changes as unicode evolves

TEXT allows unassigned code points, and generally returns the same byte
sequences that were orgiinally entered; therefore UTEXT is not a
replacement for TEXT.

UTEXT could be built-in or it could be an extension or in contrib. If
an extension, we'd probably want to at least expose a function that can
detect unassigned code points, so that it's easy to be consistent with
the auto-generated unicode tables. I also notice that there already is
an unassigned code points table in saslprep.c, but it seems to be
frozen as of Unicode 3.2, and I'm not sure why.

Questions:

* Would this be useful enough to justify a new data type? Would it be
confusing about when to choose one versus the other?
* Would cross-type comparisons between TEXT and UTEXT become a major
problem that would reduce the utility?
* Should "some_utext_value = some_text_value" coerce the LHS to TEXT
or the RHS to UTEXT?
* Other comments or am I missing something?

Regards,
Jeff Davis

[1] https://unicode.org/reports/tr15/
[2] https://www.unicode.org/policies/stability_policy.html

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-09-12 22:55:52 Re: [17] CREATE SUBSCRIPTION ... SERVER
Previous Message Jacob Champion 2023-09-12 22:09:29 Re: Row pattern recognition