Re: Pre-proposal: unicode normalized text

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-04 17:16:22
Message-ID: CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I assume you mean because we reject invalid byte sequences? Yeah, I'm
> sure that causes a problem for some (especially migrations), but it's
> difficult for me to imagine a database working well with no rules at
> all for the the basic data types.

There's a very popular commercial database where, or so I have been
led to believe, any byte sequence at all is accepted when you try to
put values into the database. The rumors I've heard -- I have not
played with it myself -- are that when you try to do anything, byte
sequences that are not valid in the configured encoding are treated as
single-byte characters or something of that sort. So like if you had
UTF-8 as the encoding and the first byte of the string is something
that can only appear as a continuation byte in UTF-8, I think that
byte is just treated as a separate character. I don't quite know how
you make all of the operations work that way, but it seems like
they've come up with a somewhat-consistent set of principles that are
applied across the board. Very different from the PG philosophy, of
course. And I'm not saying it's better. But it does eliminate the
problem of being unable to load data into the database, because in
such a model there's no such thing as invalidly-encoded data. Instead,
an encoding like UTF-8 is effectively extended so that every byte
sequence represents *something*. Whether that something is what you
wanted is another story.

At any rate, if we were to go in the direction of rejecting code
points that aren't yet assigned, or aren't yet known to the collation
library, that's another way for data loading to fail. Which feels like
very defensible behavior, but not what everyone wants, or is used to.

> At minimum I think we need to have some internal functions to check for
> unassigned code points. That belongs in core, because we generate the
> unicode tables from a specific version.

That's a good idea.

> I also think we should expose some SQL functions to check for
> unassigned code points. That sounds useful, especially since we already
> expose normalization functions.

That's a good idea, too.

> One could easily imagine a domain with CHECK(NOT
> contains_unassigned(a)). Or an extension with a data type that uses the
> internal functions.

Yeah.

> Whether we ever get to a core data type -- and more importantly,
> whether anyone uses it -- I'm not sure.

Same here.

> Yeah, I am looking for a better compromise between:
>
> * everything is memcmp() and 'á' sometimes doesn't equal 'á'
> (depending on code point sequence)
> * everything is constantly changing, indexes break, and text
> comparisons are slow
>
> A stable idea of unicode normalization based on using only assigned
> code points is very tempting.

The fact that there are multiple types of normalization and multiple
notions of equality doesn't make this easier.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nico Williams 2023-10-04 17:23:41 Re: Pre-proposal: unicode normalized text
Previous Message Merlin Moncure 2023-10-04 16:26:31 Re: Request for comment on setting binary format output per session