Re: Pre-proposal: unicode normalized text

From: Isaac Morland <isaac(dot)morland(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nico Williams <nico(at)cryptonector(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-05 01:02:21
Message-ID: CAMsGm5f8k4_C6VuerSbF2gXeVwD9kMZQk43kO85Oi4oGVL7EMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 4 Oct 2023 at 17:37, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote:
> > Always store only UTF-8 in the database
>
> What problem does that solve? I don't see our encoding support as a big
> source of problems, given that database-wide UTF-8 already works fine.
> In fact, some postgres features only work with UTF-8.
>

My idea is in the context of a suggestion that we support specifying the
encoding per column. I don't mean to suggest eliminating the ability to set
a server-wide encoding, although I doubt there is any use case for using
anything other than UTF-8 except for an old database that hasn’t been
converted yet.

I see no reason to write different strings using different encodings in the
data files, depending on what column they belong to. The various text types
are all abstract data types which store sequences of characters (not
bytes); if one wants bytes, then one has to encode them. Of course, if one
wants UTF-8 bytes, then the encoding is, under the covers, the identity
function, but conceptually it is still taking the characters stored in the
database and converting them to bytes according to a specific encoding.

By contrast, although I don’t see it as a top-priority use case, I can
imagine somebody wanting to restrict the characters stored in a particular
column to characters that can be encoded in a particular encoding. That is
what "CHARACTER SET LATIN1" and so on should mean.

> What about characters not in UTF-8?
>
> Honestly I'm not clear on this topic. Are the "private use" areas in
> unicode enough to cover use cases for characters not recognized by
> unicode? Which encodings in postgres can represent characters that
> can't be automatically transcoded (without failure) to unicode?
>

Here I’m just anticipating a hypothetical objection, “what about characters
that can’t be represented in UTF-8?” to my suggestion to always use UTF-8
and I’m saying we shouldn’t care about them. I believe the answers to your
questions in this paragraph are “yes”, and “none”.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2023-10-05 01:11:35 Re: make add_paths_to_append_rel aware of startup cost
Previous Message Daniel Fredouille 2023-10-05 00:12:19 Re: unnest multirange, returned order