Re: Pre-proposal: unicode normalized text

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Isaac Morland <isaac(dot)morland(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, Nico Williams <nico(at)cryptonector(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-05 19:16:34
Message-ID: b870285789a03a7e6ef298ba3adaf9436b829c2e.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-10-05 at 09:10 -0400, Isaac Morland wrote:
> In the case you describe, the users don’t have text at all; they have
> bytes, and a vague belief about what encoding the bytes might be in
> and therefore what characters they are intended to represent. The
> correct way to store that in the database is using bytea.

I wouldn't be so absolute. It's text data to the user, and is
presumably working fine for them now, and if they switched to bytea
today then 'foo' would show up as '\x666f6f' in psql.

The point is that this is a somewhat messy problem because there's so
much software out there that treats byte strings and textual data
interchangably. Rust goes the extra mile to organize all of this, and
it ends up with:

* String -- always UTF-8, never NUL-terminated
* CString -- NUL-terminated byte sequence with no internal NULs
* OsString[3] -- needed to make a Path[4], which is needed to open a
file[5]
* Vec<u8> -- any byte sequence

and I suppose we could work towards offering better support for these
different types, the casts between them, and delivering them in a form
the client can understand. But I wouldn't describe it as a solved
problem with one "correct" solution.

One takeaway from this discussion is that it would be useful to provide
more flexibility in how values are represented to the client in a more
general way. In addition to encoding, representational issues have come
up with binary formats, bytea, extra_float_digits, etc.

The collection of books by CJ Date & Hugh Darwen, et al. (sorry I don't
remember exactly which books), made the theoretical case for explicitly
distinguishing values from representations at the lanugage level. We're
starting to see that representational issues can't be satisfied with a
few special cases and hacks -- it's worth thinking about a general
solution to that problem. There was also a lot of relevant discussion
about how to think about overlapping domains (e.g. ASCII is valid in
any of these text domains).

> Text types should be for when you know what characters you want to
> store. In this scenario, the implementation detail of what encoding
> the database uses internally to write the data on the disk doesn't
> matter, any more than it matters to a casual user how a table is
> stored on disk.

Perhaps the user and application do know, and there's some kind of
subtlety that we're missing, or some historical artefact that we're not
accounting for, and that somehow makes UTF-8 unsuitable. Surely there
are applications that treat certain byte sequences in non-standard
ways, and perhaps not all of those byte sequences can be reproduced by
transcoding from UTF-8 to the client_encoding. In any case, I would
want to understand in detail why a user thinks UTF8 is not good enough
before I make too strong of a statement here.

Even the terminal font that I use renders some "identical" unicode
characters slightly differently depending on the code points from which
they are composed. I believe that's an intentional convenience to make
it more apparent why the "diff" command (or other byte-based tool) is
showing a difference between two textually identical strings, but it's
also a violation of unicode. (This is another reason why normalization
might not be for everyone, but I believe it's still good in typical
cases.)

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajesh Mittal 2023-10-05 19:38:59 Rights Control within DB (which SuperUser cannot access, but user can)
Previous Message Nico Williams 2023-10-05 19:14:54 Re: Pre-proposal: unicode normalized text