Re: Pre-proposal: unicode normalized text

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2024-03-01 01:02:51
Message-ID: a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.

Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.

But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)

I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.

There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.

Regards,
Jeff Davis

[1]
https://www.unicode.org/policies/stability_policy.html#Normalization

Attachment Content-Type Size
v1-0001-CREATE-DATABASE-.-STRICT_UNICODE.patch text/x-patch 31.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2024-03-01 01:08:01 Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message Melanie Plageman 2024-03-01 00:29:45 Re: BitmapHeapScan streaming read user and prelim refactoring