Re: JSON for PG 9.2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Claes Jakobsson <claes(at)surfar(dot)nu>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Jan Urbański <wulczer(at)wulczer(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Jan Wieck <janwieck(at)yahoo(dot)com>
Subject: Re: JSON for PG 9.2
Date: 2012-01-31 17:04:31
Message-ID: CA+TgmoYg_SdB70gxx2vFW3z+oB8K7aU8XnQwp+sB0_H7c2FehQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 23, 2012 at 3:20 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On sön, 2012-01-22 at 11:43 -0500, Andrew Dunstan wrote:
>> Actually, given recent discussion I think that test should just be
>> removed from json.c. We don't actually have any test that the code
>> point is valid (e.g. that it doesn't refer to an unallocated code
>> point). We don't do that elsewhere either - the unicode_to_utf8()
>> function the scanner uses to turn \unnnn escapes into utf8 doesn't
>> look for unallocated code points. I'm not sure how much other
>> validation we should do - for example on correct use of surrogate
>> pairs.
>
> We do check the correctness of surrogate pairs elsewhere.  Search for
> "surrogate" in scan.l; should be easy to copy.

I've committed a version of this that does NOT do surrogate pair
validation. Per discussion elsewhere, I also removed the check for
\uXXXX with XXXX > 007F and database encoding != UTF8. This will
complicate things somewhat when we get around to doing
canonicalization and comparison, but Tom seems confident that those
issues are manageble. I did not commit Andrew's further changes,
either; I'm assuming he'll do that himself.

With respect to the issue of whether we ought to check surrogate
pairs, the JSON spec is not a whole lot of help. RFC4627 says:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

That fails to answer the question of what we ought to do if we get an
invalid sequence there. You could make an argument that we ought to
just allow it; it doesn't particularly hinder our ability to
canonicalize or compare strings, because our notion of sort-ordering
for characters that may span multiple encodings is going to be pretty
funky anyway. We can just leave those bits as \uXXXX sequences and
call it good. However, it would hinder our ability to convert a JSON
string to a string in the database encoding: we could find an
invalidate surrogate pair that was allowable as JSON but
unrepresentable in the database encoding. On the flip side, given our
decision to allow all \uXXXX sequences even when not using UTF-8, we
could also run across a perfectly valid UTF-8 sequence that's not
representable as a character in the server encoding, so it seems we
have that problem anyway, so maybe it's not much worse to have two
reasons why it can happen rather than one. On the third hand, most
people are probably using UTF-8, and those people aren't going to have
any transcoding issues, so the invalid surrogate pair case may be the
only one they can hit (unless invalid code points are also an issue?),
so maybe it's worth avoiding on that basis.

Anyway, I defer to the wisdom of the collective on this one: how
should we handle this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Lionel Elie Mamane 2012-01-31 17:11:00 Re: information schema/aclexplode doesn't know about default privileges
Previous Message Alvaro Herrera 2012-01-31 16:58:21 Re: foreign key locks, 2nd attempt