Skip site navigation (1) Skip section navigation (2)

Re: JSON for PG 9.2

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Claes Jakobsson <claes(at)surfar(dot)nu>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Jan Urbański <wulczer(at)wulczer(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Jan Wieck <janwieck(at)yahoo(dot)com>
Subject: Re: JSON for PG 9.2
Date: 2012-01-20 17:34:44
Message-ID: 4F19A5B4.2020902@dunslane.net (view raw or flat)
Thread:
Lists: pgsql-hackers

On 01/20/2012 11:58 AM, Robert Haas wrote:
> On Fri, Jan 20, 2012 at 10:45 AM, Andrew Dunstan<andrew(at)dunslane(dot)net>  wrote:
>> XML's&#nnnn; escape mechanism is more or less the equivalent of JSON's
>> \unnnn. But XML documents can be encoded in a variety of encodings,
>> including non-unicode encodings such as Latin-1. However, no matter what the
>> document encoding,&#nnnn; designates the character with Unicode code point
>> nnnn, whether or not that is part of the document encoding's charset.
> OK.
>
>> Given that precedent, I'm wondering if we do need to enforce anything other
>> than that it is a valid unicode code point.
>>
>> Equivalence comparison is going to be difficult anyway if you're not
>> resolving all \unnnn escapes. Possibly we need some sort of canonicalization
>> function to apply for comparison purposes. But we're not providing any
>> comparison ops today anyway, so I don't think we need to make that decision
>> now. As you say, there doesn't seem to be any defined canonical form - the
>> spec is a bit light on in this respect.
> Well, we clearly have to resolve all \uXXXX to do either comparison or
> canonicalization.  The current patch does neither, but presumably we
> want to leave the door open to such things.  If we're using UTF-8 and
> comparing two strings, and we get to a position where one of them has
> a character and the other has \uXXXX, it's pretty simple to do the
> comparison: we just turn XXXX into a wchar_t and test for equality.
> That should be trivial, unless I'm misunderstanding.  If, however,
> we're not using UTF-8, we have to first turn \uXXXX into a Unicode
> code point, then covert that to a character in the database encoding,
> and then test for equality with the other character after that.  I'm
> not sure whether that's possible in general, how to do it, or how
> efficient it is.  Can you or anyone shed any light on that topic?


We know perfectly well how to turn two strings from encoding x to utf8 
(see mb_utils.c::pg_do_encoding_conversion() ). Once we've done that 
ISTM we have reduced this to the previous problem, as the mathematicians 
like to say.


cheers

andrew

In response to

pgsql-hackers by date

Next:From: Greg SmithDate: 2012-01-20 17:35:38
Subject: Re: Vacuum rate limit in KBps
Previous:From: Garick HamlinDate: 2012-01-20 17:32:45
Subject: Re: JSON for PG 9.2

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group