On 01/20/2012 09:19 AM, Robert Haas wrote:
> On Fri, Jan 20, 2012 at 12:07 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
>>> On 01/19/2012 04:12 PM, Robert Haas wrote:
>>>> On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew(at)dunslane(dot)net> wrote:
>>>>> The spec only allows unescaped Unicode chars (and for our purposes that
>>>>> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will
>>>>> result in something that's not legal JSON.
>>>> I understand. I'm proposing that we not care. In other words, if the
>>>> server encoding is UTF-8, it'll really be JSON. But if the server
>>>> encoding is something else, it'll be almost-JSON.
>>> Of course, for data going to the client, if the client encoding is UTF8,
>>> they should get legal JSON, regardless of what the database encoding is,
>>> and conversely too, no?
>> Yes. I think this argument has been mostly theologizing, along the
>> lines of how many JSON characters can dance on the head of a pin.
>> From a user's perspective, the database encoding is only a constraint on
>> which characters he can store.
>> He does not know or care what the bit
>> representation is inside the server. As such, if we store a non-ASCII
>> character in a JSON string, it's valid JSON as far as the user is
>> concerned, so long as that character exists in the Unicode standard.
>> If his client encoding is UTF8, the value will be letter-perfect JSON
>> when it gets to him; and if his client encoding is not UTF8, then he's
>> already pretty much decided that he doesn't give a fig about the
>> Unicode-centricity of the JSON spec, no?
> Also agreed. Personally, I think it may not have been a great idea to
> tie the JSON spec so closely to Unicode, but I understand that it
> would have been difficult to define an encoding-agnostic equivalent of
> \uXXXX, since it's hard to know for sure whether an arbitrary encoding
> even has a (sensible?) definition of code points, and they probably
> wanted to avoid ambiguity. But, it's bound to cause problems for any
> system that runs in some other encoding, which, when so requested, we
> do. Even if we had the ability to support multiple encodings in the
> same database, I'm not sure I'd be very excited about insisting that
> JSON data always be stored in UTF-8, because that would introduce a
> lot of unnecessary transcoding for people using other encodings and
> basically unnecessarily handicap the functionality provided by the
> datatype. But at least if we had that, people would have the *option*
> to use JSON with UTF-8 and get the fully spec-compliant behavior. As
> it is, they don't; the system we have forces the database encoding on
> all datatypes whether they like it or not, and that ain't changing for
>> So I'm with Robert: we should just plain not care. I would further
>> suggest that maybe what we should do with incoming JSON escape sequences
>> is convert them to Unicode code points and then to the equivalent
>> character in the database encoding (or throw error if there is none).
> The code I've written so far does no canonicalization of the input
> value of any kind, just as we do for XML. I'm inclined to leave it
> that way. Eventually, we might want to make the JSON datatype support
> equality comparisons and so on, and that will require the system to
> knowing that the letter r can be encoded as some \uXXXX sequence and
> that the escape \r is equivalent to some other escape \uXXXX, but
> right now all the code does is try to validate that the JSON is legal,
> NOT second-guess the user's choice about how to spell things or where
> to insert whitespace. I think that's a good choice because (1) AFAIK,
> there's no official canonicalization method for JSON, so whatever we
> pick will be something we think is best, not an official method
> sanction by the spec, (2) users might prefer the way they chose to
> represent a given value over the way we choose to represent it, and
> (3) by simply validating and storing the JSON object, rather than
> doing any canonicalization, the input function avoids the need to do
> any data copying, hopefully maximizing speed. Canonicalization can be
> added on top of what I've done here and people who want or need it can
> use it; I have some ideas around how to make that leverage the
> existing code that I intend to pursue for 9.3, but right now I'd
> rather not go there.
> So, given that framework, what the patch does is this: if you're using
> UTF-8, then \uXXXX is accepted, provided that XXXX is something that
> equates to a legal Unicode code point. It isn't converted to the
> corresponding character: it's just validated. If you're NOT using
> UTF-8, then it allows \uXXXX for code points up through 127 (which we
> assume are the same in all encodings) and anything higher than that is
> rejected. If someone knows an easy way to check whether a \uXXXX
> sequence for XXXX> 007F is a legal Unicode code point that has an
> equivalent in the current server encoding, then we can add logic to
> allow that case also, but personally I'm not that excited about it.
> Anyone who is using \uXXXX escapes with a non-Unicode coding is
> probably hoping that we can store arbitrary code points, not just the
> ones that happen to exist in the server encoding, so they're probably
> going to be disappointed whether or not we spend time eating away at
> the edges of the problem.
> It's awfully interesting to think about how we could actually make
> cross-encoding stuff work for real. Would we make it a separate
> property that can be associated with each column, like we did for
> collations? Or would we handle it more like range types - give me an
> encoding, and I'll give you a new type OID that represents text stored
> in that encoding? But I guess that's a question for another day.
(thinks out loud)
XML's &#nnnn; escape mechanism is more or less the equivalent of JSON's
\unnnn. But XML documents can be encoded in a variety of encodings,
including non-unicode encodings such as Latin-1. However, no matter what
the document encoding, &#nnnn; designates the character with Unicode
code point nnnn, whether or not that is part of the document encoding's
Given that precedent, I'm wondering if we do need to enforce anything
other than that it is a valid unicode code point.
Equivalence comparison is going to be difficult anyway if you're not
resolving all \unnnn escapes. Possibly we need some sort of
canonicalization function to apply for comparison purposes. But we're
not providing any comparison ops today anyway, so I don't think we need
to make that decision now. As you say, there doesn't seem to be any
defined canonical form - the spec is a bit light on in this respect.
In response to
pgsql-hackers by date
|Next:||From: Dimitri Fontaine||Date: 2012-01-20 16:29:16|
|Subject: Re: Inline Extension|
|Previous:||From: Robert Haas||Date: 2012-01-20 15:44:28|
|Subject: Re: CLOG contention, part 2|