Re: Pre-proposal: unicode normalized text

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-04 20:15:03
Message-ID: e6bc7d2b5eb169b986d432f2177c995fd6c02748.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
> any byte sequence at all is accepted when you try to
> put values into the database.

We support SQL_ASCII, which allows something similar.

> At any rate, if we were to go in the direction of rejecting code
> points that aren't yet assigned, or aren't yet known to the collation
> library, that's another way for data loading to fail.

A failure during data loading is either a feature or a bug, depending
on whether you are the one loading the data or the one trying to make
sense of it later ;-)

> Which feels like
> very defensible behavior, but not what everyone wants, or is used to.

Yeah, there are many reasons someone might want to accept unassigned
code points. An obvious one is if their application is on a newer
version of unicode where the codepoint *is* assigned.

>
> The fact that there are multiple types of normalization and multiple
> notions of equality doesn't make this easier.

NFC is really the only one that makes sense.

NFD is semantically the same as NFC, but expanded into a larger
representation. NFKC/NFKD are based on a more relaxed notion of
equality -- kind of like non-deterministic collations. These other
forms might make sense in certain cases, but not general use.

I believe that having a kind of text data type where it's stored in NFC
and compared with memcmp() would be a good place for many users to be -
- probably most users. It's got all the performance and stability
benefits of memcmp(), with slightly richer semantics. It's less likely
that someone malicious can confuse the database by using different
representations of the same character.

The problem is that it's not universally better for everyone: there are
certainly users who would prefer that the codepoints they send to the
database are preserved exactly, and also users who would like to be
able to use unassigned code points.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-10-04 20:18:02 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Robert Haas 2023-10-04 20:08:29 Re: trying again to get incremental backup