Re: The "char" type versus non-ASCII characters

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: The "char" type versus non-ASCII characters
Date: 2021-12-03 21:39:14
Message-ID: 61AA8E82.2010001@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/03/21 14:12, Tom Lane wrote:
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe. charout() for
> example just regurgitates the single byte as-is.

I wonder if maybe what to do about that lies downstream of some other
thought about encoding-related type properties.

ISTM we don't, at present, have a clear story for types that have an
encoding (or repertoire) property that isn't one of (inapplicable,
server_encoding).

And yet such things exist, and more such things could or should exist
(NCHAR, healthier versions of xml or json, ...). "char" is an existing
example, because its current behavior is exactly as if it declared
"I am one byte of SQL_ASCII regardless of server setting".

Which is no trouble at all when the server setting is also SQL_ASCII.
But what does it mean when the server setting and the inherent
repertoire property of a type can be different? The present answer
isn't pretty.

When can charout() be called? typoutput functions don't have any
'internal' parameters, so nothing stops user code from calling them;
I don't know how often that's done, and that's a complication.
The canonical place for it to be called is inside printtup(), when
the client driver has requested format 0 for that attribute.

Up to that point, we could have known it was a type with SQL_ASCII
wired in, but after charout() we have a cstring, and printtup treats
that type as having the server encoding, and it goes through encoding
conversion from that to the client encoding in pq_sendcountedtext.

Indeed, cstring behaves completely as if it is a type with the server
encoding. If you send a cstring with format 1 rather than format 0,
while it is no longer subject to the encoding conversion done in
pq_sendcountedtext, it will dutifully perform the same conversion
in its own cstring_send. unknownsend is the same way.

But of course a "char" column in format 1 would never go through cstring;
char_send would be called, and just plop the byte in the buffer unchanged
(which is the same operation as an encoding conversion from SQL_ASCII
to anything).

Ever since I figured out I have to look at the send/recv functions
for a type to find out if it is encoding-dependent, I have to walk myself
through those steps again every time I forget why that is. Having
the type's character-encoding details show up in its send/recv functions
and not in its in/out functions never stops being counterintuitive to me.
But for server-encoding-dependent types, that's how it is: you don't
see it in the typoutput function, because on the format-0 path,
the transcoding happens in pq_sendcountedtext. But on the format-1 path,
the same transcoding happens, this time under the type's own control
in its typsend function.

That was the second thing that surprised me: we have what we call
a text and a binary path, but for an encoding-dependent type, neither
one is a path where transcoding doesn't happen!

The difference is, the format-0 transcoding is applied blindly,
in pq_sendcountedtext, with no surviving information about the data
type (which has become cstring by that point). In contrast, on the
format-1 path, the type's typsend is in control. In theory, that would
allow type-aware conversion; a smarter xml_send could use &#n; form
for characters that won't go in the client encoding, while the blind
pq transcoding on format 0 would just botch the data.

XML, in an ideal world, might live on disk in a form that cares nothing
for the server encoding, and be sent directly over the wire to a client
(it declares what encoding it's in) and presented to the application
over an XML-aware API that isn't hamstrung by the client's default
text encoding either.

But in the present world, we have somehow arrived at a setup where
there are only two paths that can take, and either one is a funnel
that can only be passed by data that survives both the client and
the server encoding.

The FE/BE docs have said "Text has format code zero, binary has format
code one, and all other format codes are reserved for future definition"
ever since 7.4. Maybe the time will come for a format 2, where you say
"here's an encoding ID and some bytes"?

This rambled on a bit far afield from "what should charout do with
non-ASCII values?". But honestly, either nobody is storing non-ASCII
values in "char", and we could make any choice there and nothing would
break, or somebody is doing that and their stuff would be broken by any
choice of change.

So, is the current "char" situation so urgent that it demands some
one-off solution be chosen for it, or could it be neglected with minimal
risk until someday we've defined what "this datatype has encoding X that's
different from the server encoding" means, and that takes care of it?

Regards,
-Chap

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-12-03 23:12:33 Re: row filtering for logical replication
Previous Message Tom Lane 2021-12-03 21:33:42 Re: Assorted improvements in pg_dump