Re: Unicode escapes with any backend encoding

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Unicode escapes with any backend encoding
Date: 2020-01-15 22:34:09
Message-ID: 2863.1579127649@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
>> Perhaps I expressed myself badly. What I meant was that we should keep
>> the json and text escape rules in sync, as they are now. Since we're
>> changing the text rules to allow resolvable non-ascii unicode escapes
>> in non-utf8 locales, we should do the same for json.

> Got it. I'll make the patch do that in a little bit.

OK, here's v2, which brings JSONB into the fold and also makes some
effort to produce an accurate error cursor for invalid Unicode escapes.
As it's set up, we only pay the extra cost of setting up an error
context callback when we're actually processing a Unicode escape,
so I think that's an acceptable cost. (It's not much of a cost,
anyway.)

The callback support added here is pretty much a straight copy-and-paste
of the existing functions setup_parser_errposition_callback() and friends.
That's slightly annoying --- we could perhaps merge those into one.
But I didn't see a good common header to put such a thing into, so
I just did it like this.

Another note is that we could use the additional scanner infrastructure
to produce more accurate error pointers for other cases where we're
whining about a bad escape sequence, or some other sub-part of a lexical
token. I think that'd likely be a good idea, since the existing cursor
placement at the start of the token isn't too helpful if e.g. you're
dealing with a very long string constant. But to keep this focused,
I only touched the behavior for Unicode escapes. The rest could be
done as a separate patch.

This also mops up after 7f380c59 by making use of the new pg_wchar.c
exports is_utf16_surrogate_first() etc everyplace that they're relevant
(which is just the JSON code I was touching anyway, as it happens).
I also made a bit of an effort to ensure test coverage of all the
code touched in that patch and this one.

regards, tom lane

Attachment Content-Type Size
unicode-escapes-with-other-server-encodings-2.patch text/x-diff 52.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-15 22:47:54 Re: making the backend's json parser work in frontend code
Previous Message Robert Haas 2020-01-15 21:02:49 making the backend's json parser work in frontend code