Re: Pre-proposal: unicode normalized text

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-11 01:08:41
Message-ID: 2bab90239c5264fa9a87372c16bbf8759c8f9e64.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2023-10-10 at 10:02 -0400, Robert Haas wrote:
> On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut
> <peter(at)eisentraut(dot)org> wrote:
> > Can you restate what this is supposed to be for?  This thread
> > appears to
> > have morphed from "let's normalize everything" to "let's check for
> > unassigned code points", but I'm not sure what we are aiming for
> > now.

It was a "pre-proposal", so yes, the goalposts have moved a bit. Right
now I'm aiming to get some primitives in place that will be useful by
themselves, but also that we can potentially build on.

Attached is a new version of the patch which introduces some SQL
functions as well:

* unicode_is_valid(text): returns true if all codepoints are
assigned, false otherwise
* unicode_version(): version of unicode Postgres is built with
* icu_unicode_version(): version of Unicode ICU is built with

I'm not 100% clear on the consequences of differences between the PG
unicode version and the ICU unicode version, but because normalization
uses the Postgres version of Unicode, I believe the Postgres version of
Unicode should also be available to determine whether a code point is
assigned or not.

We may also find it interesting to use the PG Unicode tables for regex
character classification. This is just an idea and we can discuss
whether that makes sense or not, but having the primitives in place
seems like a good idea regardless.

> Jeff can say what he wants it for, but one obvious application would
> be to have the ability to add a CHECK constraint that forbids
> inserting unassigned code points into your database, which would be
> useful if you're worried about forward-compatibility with collation
> definitions that might be extended to cover those code points in the
> future. Another application would be to find data already in your
> database that has this potential problem.

Exactly. Avoiding unassigned code points also allows you to be forward-
compatible with normalization.

Regards,
Jeff Davis

Attachment Content-Type Size
v2-0001-Additional-unicode-primitive-functions.patch text/x-patch 214.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2023-10-11 01:33:17 interval_ops shall stop using btequalimage (deduplication)
Previous Message Andres Freund 2023-10-11 00:54:34 Re: broken master regress tests