From: | Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Claes Jakobsson <claes(at)surfar(dot)nu>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Jan Urbański <wulczer(at)wulczer(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Jan Wieck <janwieck(at)yahoo(dot)com> |
Subject: | Re: JSON for PG 9.2 |
Date: | 2012-01-14 23:11:57 |
Message-ID: | CAARyMpDS_4xcwWPH3XXcxBbOqEmGyc9YCkCXcH9q=pka1PQZYg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> Second, what should be do when the database encoding isn't UTF8? I'm
> inclined to emit a \unnnn escape for any non-ASCII character (assuming it
> has a unicode code point - are there any code points in the non-unicode
> encodings that don't have unicode equivalents?). The alternative would be to
> fail on non-ASCII characters, which might be ugly. Of course, anyone wanting
> to deal with JSON should be using UTF8 anyway, but we still have to deal
> with these things. What about SQL_ASCII? If there's a non-ASCII sequence
> there we really have no way of telling what it should be. There at least I
> think we should probably error out.
I don't think there is a satisfying solution to this problem. Things
working against us:
* Some server encodings support characters that don't map to Unicode
characters (e.g. unused slots in Windows-1252). Thus, converting to
UTF-8 and back is lossy in general.
* We want a normalized representation for comparison. This will
involve a mixture of server and Unicode characters, unless the
encoding is UTF-8.
* We can't efficiently convert individual characters to and from
Unicode with the current API.
* What do we do about \u0000 ? TEXT datums cannot contain NUL characters.
I'd say just ban Unicode escapes and non-ASCII characters unless the
server encoding is UTF-8, and ban all \u0000 escapes. It's easy, and
whatever we support later will be a superset of this.
Strategies for handling this situation have been discussed in prior
emails. This is where things got stuck last time.
- Joey
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Kupershmidt | 2012-01-15 00:13:05 | Re: Dry-run mode for pg_archivecleanup |
Previous Message | Thomas Munro | 2012-01-14 22:51:54 | Re: WIP -- renaming implicit sequences |