The "char" type versus non-ASCII characters

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Chapman Flack <chap(at)anastigmatix(dot)net>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: The "char" type versus non-ASCII characters
Date: 2021-12-03 19:12:10
Message-ID: 2318797.1638558730@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[ breaking off a different new thread ]

Chapman Flack <chap(at)anastigmatix(dot)net> writes:
> Then there's "char". It's category S, but does not apply the server
> encoding. You could call it an 8-bit int type, but it's typically used
> as a character, making it well-defined for ASCII values and not so
> for others, just like SQL_ASCII encoding. You could as well say that
> the "char" type has a defined encoding of SQL_ASCII at all times,
> regardless of the database encoding.

This reminds me of something I've been intending to bring up, which
is that the "char" type is not very encoding-safe. charout() for
example just regurgitates the single byte as-is. I think we deemed
that okay the last time anyone thought about it, but that was when
single-byte encodings were the mainstream usage for non-ASCII data.
If you're using UTF8 or another multi-byte server encoding, it's
quite easy to get an invalidly-encoded string this way, which at
minimum is going to break dump/restore scenarios.

I can think of at least three ways we might address this:

* Forbid all non-ASCII values for type "char". This results in
simple and portable semantics, but it might break usages that
work okay today.

* Allow such values only in single-byte server encodings. This
is a bit messy, but it wouldn't break any cases that are not
problematic already.

* Continue to allow non-ASCII values, but change charin/charout,
char_text, etc so that the external representation is encoding-safe
(perhaps make it an octal or decimal number).

Either of the first two ways would have to contemplate what to do
with disallowed values that snuck into the DB via pg_upgrade.
That leads me to think that the third way might be the most
preferable, even though it's not terribly backward-compatible.

There's a nearby issue that we might do something about at the
same time, which is that chartoi4() and i4tochar() think that
the byte value of a "char" is signed, while all the other
operations treat it as unsigned. I wouldn't be too surprised if
this behavior is the direct cause of the bug fixed in a6bd28beb.
The issue vanishes if we forbid non-ASCII values, but otherwise
I'd be inclined to change these functions to treat the byte
values as unsigned.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-12-03 19:35:03 Re: The "char" type versus non-ASCII characters
Previous Message Tom Lane 2021-12-03 18:50:57 Re: types reliant on encodings [was Re: Dubious usage of TYPCATEGORY_STRING]