Re: The "char" type versus non-ASCII characters

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Chapman Flack <chap(at)anastigmatix(dot)net>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: The "char" type versus non-ASCII characters
Date: 2021-12-04 16:34:43
Message-ID: 2644723.1638635683@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Chapman Flack <chap(at)anastigmatix(dot)net> writes:
> On 12/03/21 14:12, Tom Lane wrote:
>> This reminds me of something I've been intending to bring up, which
>> is that the "char" type is not very encoding-safe. charout() for
>> example just regurgitates the single byte as-is.

> I wonder if maybe what to do about that lies downstream of some other
> thought about encoding-related type properties.

As you mentioned upthread, it's probably wrong to think of "char" as
character data at all. The catalogs use it as a poor man's enum type,
and it's just for convenience that we assign readable ASCII codes for
the enum values of a given column. The only reason to think of it as
encoding-dependent would be if you have ambitions to store a non-ASCII
character in a "char". But I think that's something we want to
strongly discourage, even if we don't prohibit it altogether. The
whole point of the type is to be one byte, so only in legacy encodings
can it possibly represent a non-ASCII character.

So I'm visualizing it as a uint8 that we happen to like to store
ASCII codes in, and that's what prompts the thought of using a
numeric representation for non-ASCII values. I think you're just
in for pain if you want to consider such values as character data
rather than numbers.

> ... "char" is an existing
> example, because its current behavior is exactly as if it declared
> "I am one byte of SQL_ASCII regardless of server setting".

But it's not quite that. If we treated it as SQL_ASCII, we'd refuse
to convert it to some other encoding unless the value passes encoding
verification, which is exactly what charout() is not doing.

> Indeed, cstring behaves completely as if it is a type with the server
> encoding.

Yup, cstring is definitely presumed to be in the server's encoding.

> So, is the current "char" situation so urgent that it demands some
> one-off solution be chosen for it, or could it be neglected with minimal
> risk until someday we've defined what "this datatype has encoding X that's
> different from the server encoding" means, and that takes care of it?

I'm not willing to leave it broken in the rather faint hope that
someday there will be a more general solution, especially since
I don't buy the premise that "char" ought to participate in any
such solution.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2021-12-04 18:07:50 Re: The "char" type versus non-ASCII characters
Previous Message Andrew Dunstan 2021-12-04 15:16:50 Re: A test for replay of regression tests