Re: JSON and unicode surrogate pairs

From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: JSON and unicode surrogate pairs
Date: 2013-06-11 10:53:56
Message-ID: 51B701C4.1060302@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/11/2013 10:47 AM, Andres Freund wrote:
> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:
>>> It's legal, is it not, to just write the equivalent Unicode character in
>>> the JSON string and not use the escapes? If so I would think that that
>>> would be the most common usage. If someone's writing an escape, they
>>> probably had a reason for doing it that way, and might not appreciate
>>> our overriding their decision.
>> We never store the converted values in the JSON object, nor do we return
>> them from functions that return JSON. But many of the functions and
>> operators that process the JSON have variants that return text instead of
>> JSON, and in those cases, when the value returned is a JSON string, we do
>> the following to it:
>>
>> I have just realized that the problem is actually quite a lot bigger than
>> that. We also use this value for field name comparison. So, let us suppose
>> that we have a LATIN1 database and a piece of JSON with a field name
>> containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
>> Making that processable so it doesn't blow up would be mighty tricky and
>> error prone. The non-orthogonality I suggested as a solution upthread is, by
>> contrast, very small and easy to manage, and not terribly hard to explain -
>> see attached.
> I think this all shows pretty clearly that it was a mistake allowing
> json data in the database that we cannot entirely display with the
> database's encoding. All the proposed ugly workarounds are only
> necessary because we don't throw an error when originally validating the
> json.
> Even in an utf-8 database you can get errors due to \u unescaping (at
> attribute access time, *NOT* at json_in() time) due to invalidate
> surrogate pairs.
>
> I think this goes countrary to normal postgres approach of validating
> data as strict as necessary. And I think we are going to regret not
> fixing this while there are still relatively few users out there.
Exactly -

* allow in only valid JSON.
* Validate all utf8 strings for valid unicode.
* have one canonic way of outputting unicode - utf8 for utf8 databases,
escaped for all other encodings
* If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

(I know that currently this is noty true and will happen only once I
read in the json value in client)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-06-11 11:07:19 Re: DO ... RETURNING
Previous Message Hannu Krosing 2013-06-11 10:14:41 Re: DO ... RETURNING