Re: JSON for PG 9.2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Claes Jakobsson <claes(at)surfar(dot)nu>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Jan Urbański <wulczer(at)wulczer(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Jan Wieck <janwieck(at)yahoo(dot)com>
Subject: Re: JSON for PG 9.2
Date: 2012-01-20 16:58:07
Message-ID: CA+TgmoZksnjJTN4ejqPXOvZE5hWDEfj5AqTH=yzZYz4PhczL9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 20, 2012 at 10:45 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> XML's &#nnnn; escape mechanism is more or less the equivalent of JSON's
> \unnnn. But XML documents can be encoded in a variety of encodings,
> including non-unicode encodings such as Latin-1. However, no matter what the
> document encoding, &#nnnn; designates the character with Unicode code point
> nnnn, whether or not that is part of the document encoding's charset.

OK.

> Given that precedent, I'm wondering if we do need to enforce anything other
> than that it is a valid unicode code point.
>
> Equivalence comparison is going to be difficult anyway if you're not
> resolving all \unnnn escapes. Possibly we need some sort of canonicalization
> function to apply for comparison purposes. But we're not providing any
> comparison ops today anyway, so I don't think we need to make that decision
> now. As you say, there doesn't seem to be any defined canonical form - the
> spec is a bit light on in this respect.

Well, we clearly have to resolve all \uXXXX to do either comparison or
canonicalization. The current patch does neither, but presumably we
want to leave the door open to such things. If we're using UTF-8 and
comparing two strings, and we get to a position where one of them has
a character and the other has \uXXXX, it's pretty simple to do the
comparison: we just turn XXXX into a wchar_t and test for equality.
That should be trivial, unless I'm misunderstanding. If, however,
we're not using UTF-8, we have to first turn \uXXXX into a Unicode
code point, then covert that to a character in the database encoding,
and then test for equality with the other character after that. I'm
not sure whether that's possible in general, how to do it, or how
efficient it is. Can you or anyone shed any light on that topic?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2012-01-20 17:12:13 Re: JSON for PG 9.2
Previous Message Robert Haas 2012-01-20 16:49:06 Re: Inline Extension