Re: Pre-proposal: unicode normalized text

From: Nico Williams <nico(at)cryptonector(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-11-02 23:23:19
Message-ID: ZUQvZ2HQIQqG3U8Z@ubby21
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote:
> > The fact that there are multiple types of normalization and multiple
> > notions of equality doesn't make this easier.

And then there's text that isn't normalized to any of them.

> NFC is really the only one that makes sense.

Yes.

Most input modes produce NFC, though there may be scripts (like Hangul)
where input modes might produce NFD, so I wouldn't say NFC is universal.

Unfortunately HFS+ uses NFD so NFD can leak into places naturally enough
through OS X.

> I believe that having a kind of text data type where it's stored in NFC
> and compared with memcmp() would be a good place for many users to be -
> - probably most users. It's got all the performance and stability
> benefits of memcmp(), with slightly richer semantics. It's less likely
> that someone malicious can confuse the database by using different
> representations of the same character.
>
> The problem is that it's not universally better for everyone: there are
> certainly users who would prefer that the codepoints they send to the
> database are preserved exactly, and also users who would like to be
> able to use unassigned code points.

The alternative is forminsensitivity, where you compare strings as
equal even if they aren't memcmp() eq as long as they are equal when
normalized. This can be made fast, though not as fast as memcmp().

The problem with form insensitivity is that you might have to implement
it in numerous places. In ZFS there's only a few, but in a database
every index type, for example, will need to hook in form insensitivity.
If so then that complexity would be a good argument to just normalize.

Nico
--

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-11-02 23:38:26 Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?
Previous Message Nico Williams 2023-11-02 23:17:33 Re: Pre-proposal: unicode normalized text