Re: Pre-proposal: unicode normalized text

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Isaac Morland <isaac(dot)morland(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Chapman Flack <chap(at)anastigmatix(dot)net>, Nico Williams <nico(at)cryptonector(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-05 11:31:54
Message-ID: CA+TgmobAxizsgjxvZdEQxjEs6RA3qu7JLti_LdXtaXODJoWzNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland <isaac(dot)morland(at)gmail(dot)com> wrote:
>> > What about characters not in UTF-8?
>>
>> Honestly I'm not clear on this topic. Are the "private use" areas in
>> unicode enough to cover use cases for characters not recognized by
>> unicode? Which encodings in postgres can represent characters that
>> can't be automatically transcoded (without failure) to unicode?
>
> Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to my suggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions in this paragraph are “yes”, and “none”.

Years ago, I remember SJIS being cited as an example of an encoding
that had characters which weren't part of Unicode. I don't know
whether this is still a live issue.

But I do think that sometimes users are reluctant to perform encoding
conversions on the data that they have. Sometimes they're not
completely certain what encoding their data is in, and sometimes
they're worried that the encoding conversion might fail or produce
wrong answers. In theory, if your existing data is validly encoded and
you know what encoding it's in and it's easily mapped onto UTF-8,
there's no problem. You can just transcode it and be done. But a lot
of times the reality is a lot messier than that.

Which gives me some sympathy with the idea of wanting multiple
character sets within a database. Such a feature exists in some other
database systems and is, presumably, useful to some people. On the
other hand, to do that in PostgreSQL, we'd need to propagate the
character set/encoding information into all of the places that
currently get the typmod and collation, and that is not a small number
of places. It's a lot of infrastructure for the project to carry
around for a feature that's probably only going to continue to become
less relevant.

I suppose you never know, though. Maybe the Unicode consortium will
explode in a tornado of fiery rage and there will be dueling standards
making war over the proper way of representing an emoji of a dog
eating broccoli for decades to come. In that case, our hypothetical
multi-character-set feature might seem prescient.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-10-05 11:56:17 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message "Anitha S" 2023-10-05 11:05:08 Two Window aggregate node for logically same over clause