Re: The "char" type versus non-ASCII characters

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Chapman Flack <chap(at)anastigmatix(dot)net>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: The "char" type versus non-ASCII characters
Date: 2021-12-03 19:35:03
Message-ID: c44b31d4-044a-0e45-1a98-995517b47df7@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 12/3/21 14:12, Tom Lane wrote:
> [ breaking off a different new thread ]
>
> Chapman Flack <chap(at)anastigmatix(dot)net> writes:
>> Then there's "char". It's category S, but does not apply the server
>> encoding. You could call it an 8-bit int type, but it's typically used
>> as a character, making it well-defined for ASCII values and not so
>> for others, just like SQL_ASCII encoding. You could as well say that
>> the "char" type has a defined encoding of SQL_ASCII at all times,
>> regardless of the database encoding.
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe. charout() for
> example just regurgitates the single byte as-is. I think we deemed
> that okay the last time anyone thought about it, but that was when
> single-byte encodings were the mainstream usage for non-ASCII data.
> If you're using UTF8 or another multi-byte server encoding, it's
> quite easy to get an invalidly-encoded string this way, which at
> minimum is going to break dump/restore scenarios.
>
> I can think of at least three ways we might address this:
>
> * Forbid all non-ASCII values for type "char". This results in
> simple and portable semantics, but it might break usages that
> work okay today.
>
> * Allow such values only in single-byte server encodings. This
> is a bit messy, but it wouldn't break any cases that are not
> problematic already.
>
> * Continue to allow non-ASCII values, but change charin/charout,
> char_text, etc so that the external representation is encoding-safe
> (perhaps make it an octal or decimal number).
>
> Either of the first two ways would have to contemplate what to do
> with disallowed values that snuck into the DB via pg_upgrade.
> That leads me to think that the third way might be the most
> preferable, even though it's not terribly backward-compatible.
>

I don't like #2. Is #3 going to change the external representation only
for non-ASCII values? If so, that seems OK.  Changing it for ASCII
values seems ugly. #1 is the simplest to implement and to understand,
and I suspect it would break very little in practice, but others might
disagree with that assessment.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-12-03 19:42:11 Re: The "char" type versus non-ASCII characters
Previous Message Tom Lane 2021-12-03 19:12:10 The "char" type versus non-ASCII characters