On Fri, Jan 20, 2012 at 12:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> On 01/19/2012 04:12 PM, Robert Haas wrote:
>>> On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew(at)dunslane(dot)net> wrote:
>>>> The spec only allows unescaped Unicode chars (and for our purposes that
>>>> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will
>>>> result in something that's not legal JSON.
>>> I understand. I'm proposing that we not care. In other words, if the
>>> server encoding is UTF-8, it'll really be JSON. But if the server
>>> encoding is something else, it'll be almost-JSON.
>> Of course, for data going to the client, if the client encoding is UTF8,
>> they should get legal JSON, regardless of what the database encoding is,
>> and conversely too, no?
> Yes. I think this argument has been mostly theologizing, along the
> lines of how many JSON characters can dance on the head of a pin.
> From a user's perspective, the database encoding is only a constraint on
> which characters he can store.
> He does not know or care what the bit
> representation is inside the server. As such, if we store a non-ASCII
> character in a JSON string, it's valid JSON as far as the user is
> concerned, so long as that character exists in the Unicode standard.
> If his client encoding is UTF8, the value will be letter-perfect JSON
> when it gets to him; and if his client encoding is not UTF8, then he's
> already pretty much decided that he doesn't give a fig about the
> Unicode-centricity of the JSON spec, no?
Also agreed. Personally, I think it may not have been a great idea to
tie the JSON spec so closely to Unicode, but I understand that it
would have been difficult to define an encoding-agnostic equivalent of
\uXXXX, since it's hard to know for sure whether an arbitrary encoding
even has a (sensible?) definition of code points, and they probably
wanted to avoid ambiguity. But, it's bound to cause problems for any
system that runs in some other encoding, which, when so requested, we
do. Even if we had the ability to support multiple encodings in the
same database, I'm not sure I'd be very excited about insisting that
JSON data always be stored in UTF-8, because that would introduce a
lot of unnecessary transcoding for people using other encodings and
basically unnecessarily handicap the functionality provided by the
datatype. But at least if we had that, people would have the *option*
to use JSON with UTF-8 and get the fully spec-compliant behavior. As
it is, they don't; the system we have forces the database encoding on
all datatypes whether they like it or not, and that ain't changing for
> So I'm with Robert: we should just plain not care. I would further
> suggest that maybe what we should do with incoming JSON escape sequences
> is convert them to Unicode code points and then to the equivalent
> character in the database encoding (or throw error if there is none).
The code I've written so far does no canonicalization of the input
value of any kind, just as we do for XML. I'm inclined to leave it
that way. Eventually, we might want to make the JSON datatype support
equality comparisons and so on, and that will require the system to
knowing that the letter r can be encoded as some \uXXXX sequence and
that the escape \r is equivalent to some other escape \uXXXX, but
right now all the code does is try to validate that the JSON is legal,
NOT second-guess the user's choice about how to spell things or where
to insert whitespace. I think that's a good choice because (1) AFAIK,
there's no official canonicalization method for JSON, so whatever we
pick will be something we think is best, not an official method
sanction by the spec, (2) users might prefer the way they chose to
represent a given value over the way we choose to represent it, and
(3) by simply validating and storing the JSON object, rather than
doing any canonicalization, the input function avoids the need to do
any data copying, hopefully maximizing speed. Canonicalization can be
added on top of what I've done here and people who want or need it can
use it; I have some ideas around how to make that leverage the
existing code that I intend to pursue for 9.3, but right now I'd
rather not go there.
So, given that framework, what the patch does is this: if you're using
UTF-8, then \uXXXX is accepted, provided that XXXX is something that
equates to a legal Unicode code point. It isn't converted to the
corresponding character: it's just validated. If you're NOT using
UTF-8, then it allows \uXXXX for code points up through 127 (which we
assume are the same in all encodings) and anything higher than that is
rejected. If someone knows an easy way to check whether a \uXXXX
sequence for XXXX > 007F is a legal Unicode code point that has an
equivalent in the current server encoding, then we can add logic to
allow that case also, but personally I'm not that excited about it.
Anyone who is using \uXXXX escapes with a non-Unicode coding is
probably hoping that we can store arbitrary code points, not just the
ones that happen to exist in the server encoding, so they're probably
going to be disappointed whether or not we spend time eating away at
the edges of the problem.
It's awfully interesting to think about how we could actually make
cross-encoding stuff work for real. Would we make it a separate
property that can be associated with each column, like we did for
collations? Or would we handle it more like range types - give me an
encoding, and I'll give you a new type OID that represents text stored
in that encoding? But I guess that's a question for another day.
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Dimitri Fontaine||Date: 2012-01-20 14:28:40|
|Subject: Re: Command Triggers|
|Previous:||From: Heikki Linnakangas||Date: 2012-01-20 14:11:48|
|Subject: Re: Scaling XLog insertion (was Re: Moving more work outside