Re: Unicode escapes with any backend encoding

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Unicode escapes with any backend encoding
Date: 2020-01-14 22:03:33
Message-ID: ef2648e8-66dc-c95c-c5ad-72ff05191c2c@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/14/20 4:25 PM, Tom Lane wrote:
> Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
>> On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap(at)anastigmatix(dot)net> wrote:
>>> On 1/14/20 10:10 AM, Tom Lane wrote:
>>>> to me that this error is just useless pedantry. As long as the DB
>>>> encoding can represent the desired character, it should be transparent
>>>> to users.
>
>>> That's my position too.
>
>> and mine.
>
> I'm confused --- yesterday you seemed to be against this idea.
> Have you changed your mind?
>
> I'll gladly go change the patch if people are on board with this.

Hmm, well, let me clarify for my own part what I think I'm agreeing
with ... perhaps it's misaligned with something further upthread.

In an ideal world (which may be ideal in more ways than are in scope
for the present discussion) I would expect to see these principles:

1. On input, whether a Unicode escape is or isn't allowed should
not depend on any encoding settings. It should be lexically
allowed always, and if it represents a character that exists
in the server encoding, it should mean that character. If it's
not representable in the storage format, it should produce an
error that says that.

2. If it happens that the character is representable in both the
storage encoding and the client encoding, it shouldn't matter
whether it arrives literally as an é or as an escape. Either
should get stored on disk as the same bytes.

3. On output, as long as the character is representable in the client
encoding, there is nothing to worry about. It will be sent as its
representation in the client encoding (which may be different bytes
than its representation in the server encoding).

4. If a character to be output isn't in the client encoding, it
will be datatype-dependent whether there is any way to escape.
For example, xml_out could produce &#x????; forms, and json_out
could produce \u???? forms.

5. If the datatype being output has no escaping rules available
(as would be the case for an ordinary text column, say), then
the unrepresentable character has to be reported in an error.
(Encoding conversions often have the option of substituting
a replacement character like ? but I don't believe a DBMS has
any business making such changes to data, unless by explicit
opt-in. If it can't give you the data you wanted, it should
say "here's why I can't give you that.")

6. While 'text' in general provides no escaping mechanism, some
functions that produce text may still have that option. For
example, quote_literal and quote_ident could conceivably
produce the U&'...' or U&"..." forms, respectively, if
the argument contains characters that won't go in the client
encoding.

I understand that on the way from 1 to 6 I will have drifted
further from what's discussed in this thread; for example, I bet
that quote_literal/quote_ident never produce U& forms now, and
that no one is proposing to change that, and I'm pretending not
to notice the question of how astonishing such behavior could be.
(Not to mention, how would they know whether they are returning
a value that's destined to go across the client encoding, rather
than to be used in a purely server-side expression? Maybe distinct
versions of those functions could take an encoding argument, and
produce the U& forms when the content won't go in the specified
encoding. That would avoid astonishing changes to existing functions.)

Regards,
-Chap

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2020-01-14 22:09:18 Re: Use compiler intrinsics for bit ops in hash
Previous Message Tom Lane 2020-01-14 22:01:01 Re: aggregate crash