Quick Links

JSON and unicode surrogate pairs

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	JSON and unicode surrogate pairs
Date:	2013-06-05 14:46:15
Message-ID:	51AF4F37.107@dunslane.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

In 9.2, the JSON parser didn't check the validity of the use of unicode
escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
that is still the case. However, the JSON accessor functions and
operators also try to turn JSON strings into text in the server
encoding, and this includes de-escaping \u sequences. This works fine
except when there is a pair of sequences representing a UTF-16 type
surrogate pair, something that is explicitly permitted in the JSON spec.

The attached patch is an attempt to remedy that, and a surrogate pair is
turned into the correct code point before converting it to whatever the
server encoding is.

Note that this would mean we can still put JSON with incorrect use of
surrogates into the database, as now (9.2 and later), and they will
cause almost all the accessor functions to raise an error, as now (9.3).
All this does is allow JSON that uses surrogates correctly not to fail
when applying the accessor functions and operators. That's a possible
violation of POLA, and at least worth of a note in the docs, but I'm not
sure what else we can do now - adding this check to the input lexer
would possibly cause restores to fail, which users might not thank us for.

cheers

andrew

Attachment	Content-Type	Size
json-surrogate.patch	text/x-patch	2.9 KB

Responses

Re: JSON and unicode surrogate pairs at 2013-06-06 16:53:39 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2013-06-05 15:01:44	Re: extensible external toast tuple support & snappy prototype
Previous Message	Hannu Krosing	2013-06-05 14:46:11	Re: MVCC catalog access