Re: jsonb, unicode escapes and escaped backslashes

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: jsonb, unicode escapes and escaped backslashes
Date: 2015-01-23 07:18:30
Message-ID: 20150123071830.GA3218944@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
> The following case has just been brought to my attention (look at the
> differing number of backslashes):
>
> andrew=# select jsonb '"\\u0000"';
> jsonb
> ----------
> "\u0000"
> (1 row)
>
> andrew=# select jsonb '"\u0000"';
> jsonb
> ----------
> "\u0000"
> (1 row)

A mess indeed. The input is unambiguous, but the jsonb representation can't
distinguish "\u0000" from "\\u0000". Some operations on the original json
type have similar problems, since they use an in-memory binary representation
with the same shortcoming:

[local] test=# select json_array_element_text($$["\u0000"]$$, 0) =
test-# json_array_element_text($$["\\u0000"]$$, 0);
?column?
----------
t
(1 row)

> Things get worse, though. On output, '\uabcd' for any four hex digits is
> recognized as a unicode escape, and thus the backslash is not escaped, so
> that we get:
>
> andrew=# select jsonb '"\\uabcd"';
> jsonb
> ----------
> "\uabcd"
> (1 row)
>
>
> We could probably fix this fairly easily for non- U+0000 cases by having
> jsonb_to_cstring use a different escape_json routine.

Sounds reasonable. For 9.4.1, before more people upgrade?

> But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case
> would look like.

Agreed. When a string unescape algorithm removes some kinds of backslash
escapes and not others, it's nigh inevitable that two semantically-distinct
inputs can yield the same output. json_lex_string() fell into that trap by
making an exception for \u0000. To fix this, the result needs to be fully
unescaped (\u0000 converted to the NUL byte) or retain all backslash escapes.
(Changing that either way is no fun now that an on-disk format is at stake.)

> Maybe we should detect such input and emit a warning of
> ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
> like, since this is a report from the field.

Perhaps. Something like "WARNING: jsonb cannot represent \\u0000; reading as
\u0000"? Alas, but I do prefer that to silent data corruption.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2015-01-23 08:17:13 Re: Perl coding error in msvc build system?
Previous Message David Rowley 2015-01-23 07:18:16 Re: B-Tree support function number 3 (strxfrm() optimization)