Re: Unicode escapes with any backend encoding

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject: Re: Unicode escapes with any backend encoding
Date: 2020-01-14 15:10:36
Message-ID: 7317.1579014636@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
>> On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Grepping for other direct uses of unicode_to_utf8(), I notice that
>>> there are a couple of places in the JSON code where we have a similar
>>> restriction that you can only write a Unicode escape in UTF8 server
>>> encoding. I'm not sure whether these same semantics could be
>>> applied there, so I didn't touch that.

>> Off the cuff I'd be inclined to say we should keep the text escape
>> rules the same. We've already extended the JSON standard y allowing
>> non-UTF8 encodings.

> Right. I'm just thinking though that if you can write "é" literally
> in a JSON string, even though you're using LATIN1 not UTF8, then why
> not allow writing that as "\u00E9" instead? The latter is arguably
> truer to spec.
> However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad,
> unless we have a way to undo it on printout. So there might be
> some more moving parts here than I thought.

On third thought, what would be so bad about that? Let's suppose
I write:

INSERT ... values('{"x": "\u00E9"}'::jsonb);

and the jsonb parsing logic chooses to collapse the backslash to
the represented character, i.e., "é". Why should it matter whether
the database encoding is UTF8 or LATIN1? If I am using UTF8
client encoding, I will see the "é" in UTF8 encoding either way,
because of output encoding conversion. If I am using LATIN1
client encoding, I will see the "é" in LATIN1 either way --- or
at least, I will if the database encoding is UTF8. Right now I get
an error for that when the database encoding is LATIN1 ... but if
I store the "é" as literal "é", it works, either way. So it seems
to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-01-14 15:15:14 Re: Setting min/max TLS protocol in clientside libpq
Previous Message Atsushi Torikoshi 2020-01-14 15:08:06 Re: Add pg_file_sync() to adminpack