On Fri, Jan 20, 2012 at 10:45 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> XML's &#nnnn; escape mechanism is more or less the equivalent of JSON's
> \unnnn. But XML documents can be encoded in a variety of encodings,
> including non-unicode encodings such as Latin-1. However, no matter what the
> document encoding, &#nnnn; designates the character with Unicode code point
> nnnn, whether or not that is part of the document encoding's charset.
> Given that precedent, I'm wondering if we do need to enforce anything other
> than that it is a valid unicode code point.
> Equivalence comparison is going to be difficult anyway if you're not
> resolving all \unnnn escapes. Possibly we need some sort of canonicalization
> function to apply for comparison purposes. But we're not providing any
> comparison ops today anyway, so I don't think we need to make that decision
> now. As you say, there doesn't seem to be any defined canonical form - the
> spec is a bit light on in this respect.
Well, we clearly have to resolve all \uXXXX to do either comparison or
canonicalization. The current patch does neither, but presumably we
want to leave the door open to such things. If we're using UTF-8 and
comparing two strings, and we get to a position where one of them has
a character and the other has \uXXXX, it's pretty simple to do the
comparison: we just turn XXXX into a wchar_t and test for equality.
That should be trivial, unless I'm misunderstanding. If, however,
we're not using UTF-8, we have to first turn \uXXXX into a Unicode
code point, then covert that to a character in the database encoding,
and then test for equality with the other character after that. I'm
not sure whether that's possible in general, how to do it, or how
efficient it is. Can you or anyone shed any light on that topic?
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: David E. Wheeler||Date: 2012-01-20 17:12:13|
|Subject: Re: JSON for PG 9.2 |
|Previous:||From: Robert Haas||Date: 2012-01-20 16:49:06|
|Subject: Re: Inline Extension|