Re: [PATCH] json_lex_string: don't overread on bad UTF8

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date: 2024-05-07 03:42:55
Message-ID: ZjmjPyA29dIJjmjI@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:
> On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>> but for the general encoding conversion we have what
>> would appear to be the same behavior in report_invalid_encoding(), and
>> we go out of our way there to produce a verbose error message including
>> the invalid data.

I was looking for that a couple of days ago in the backend but could
not put my finger on it. Thanks.

> We could port something like that to src/common. IMO that'd be more
> suited for an actual conversion routine, though, as opposed to a
> parser that for the most part assumes you didn't lie about the input
> encoding and is just trying not to crash if you're wrong. Most of the
> time, the parser just copies bytes between delimiters around and it's
> up to the caller to handle encodings... the exceptions to that are the
> \uXXXX escapes and the error handling.

Hmm. That would still leave the backpatch issue at hand, which is
kind of confusing to leave as it is. Would it be complicated to
truncate the entire byte sequence in the error message and just give
up because we cannot do better if the input byte sequence is
incomplete? We could still have some information depending on the
string given in input, which should be enough, hopefully. With the
location pointing to the beginning of the sequence, even better.

> Offhand, are all of our supported frontend encodings
> self-synchronizing? By that I mean, is it safe to print a partial byte
> sequence if the locale isn't UTF-8? (As I type this I'm starting at
> Shift-JIS, and thinking "probably not.")
>
> Actually -- hopefully this is not too much of a tangent -- that
> further crystallizes a vague unease about the API that I have. The
> JsonLexContext is initialized with something called the
> "input_encoding", but that encoding is necessarily also the output
> encoding for parsed string literals and error messages. For the server
> side that's fine, but frontend clients have the input_encoding locked
> to UTF-8, which seems like it might cause problems? Maybe I'm missing
> code somewhere, but I don't see a conversion routine from
> json_errdetail() to the actual client/locale encoding. (And the parser
> does not support multibyte input_encodings that contain ASCII in trail
> bytes.)

Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
its conversion for FRONTEND, I guess? Yep. This limitation looks
like a problem, especially if plugging that to libpq.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2024-05-07 03:43:50 Re: Statistics Import and Export
Previous Message jian he 2024-05-07 03:34:49 Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.