Re: Pre-proposal: unicode normalized text

From: Isaac Morland <isaac(dot)morland(at)gmail(dot)com>
To: Chapman Flack <chap(at)anastigmatix(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Nico Williams <nico(at)cryptonector(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-04 18:14:45
Message-ID: CAMsGm5eyegAkb2Eq+8d2j+BxDnSZXH0AUciwi6rA60V_ft=1dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 4 Oct 2023 at 14:05, Chapman Flack <chap(at)anastigmatix(dot)net> wrote:

> On 2023-10-04 13:47, Robert Haas wrote:
>

> The SQL standard would have me able to:
>
> CREATE TABLE foo (
> a CHARACTER VARYING CHARACTER SET UTF8,
> b CHARACTER VARYING CHARACTER SET LATIN1
> )
>
> and so on, and write character literals like
>
> _UTF8'Hello, world!' and _LATIN1'Hello, world!'
>
> and have those columns and data types independently contain what
> they can contain, without constraints imposed by one overall
> database encoding.
>
> Obviously, we're far from being able to do that. But should it
> become desirable to get closer, would it be worthwhile to also
> try to follow how the standard would have it look?
>
> Clearly, part of the job would involve making the wire protocol
> able to transmit binary values and identify their encodings.
>

I would go in the other direction (note: I’m ignoring all backward
compatibility considerations related to the current design of Postgres).

Always store only UTF-8 in the database, and send only UTF-8 on the wire
protocol. If we still want to have a concept of "client encoding", have the
client libpq take care of translating the bytes between the bytes used by
the caller and the bytes sent on the wire.

Note that you could still define columns as you say, but the character set
specification would effectively act simply as a CHECK constraint on the
characters allowed, essentially CHECK (column_name ~ '^[...all characters
in encoding...]$*'). We don't allow different on-disk representations of
dates or other data types; except when we really need to, and then we have
multiple data types (e.g. int vs. float) rather than different ways of
storing the same datatype.

What about characters not in UTF-8? If a character is important enough for
us to worry about in Postgres, it’s important enough to get a U+ number
from the Unicode Consortium, which automatically puts it in UTF-8. In the
modern context, "plain text" mean "UTF-8 encoded text", as far as I'm
concerned.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2023-10-04 18:30:45 Re: Request for comment on setting binary format output per session
Previous Message Robert Haas 2023-10-04 18:05:58 Re: Pre-proposal: unicode normalized text