From: | Martijn van Oosterhout <kleptog(at)svana(dot)org> |
---|---|
To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
Cc: | pg(at)heroku(dot)com, robertmhaas(at)gmail(dot)com, pavel(dot)stehule(at)gmail(dot)com, peter_e(at)gmx(dot)net, arul(at)fast(dot)au(dot)fujitsu(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Proposal - Support for National Characters functionality |
Date: | 2013-07-16 21:07:27 |
Message-ID: | 20130716210727.GD28628@svana.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote:
> > Does support for alternative multi-byte encodings have something to do
> > with the Han unification controversy? I don't know terribly much about
> > this, so apologies if that's just wrong.
>
> There's a famous problem regarding conversion between Unicode and other
> encodings, such as Shift Jis.
>
> There are lots of discussion on this. Here is the one from Microsoft:
>
> http://support.microsoft.com/kb/170559/EN-US
Apart from Shift-JIS not being a well defined (it's more a family of
encodings) it has the unusual feature of providing multiple ways to
encode the same character. This is not even a Han unification issue,
they have largely been addressed. For example, the square-root symbol
exists twice (0x8795 and 0x81E3) and many other mathmatical symbols
also.
Here's the code page which you can browse online:
http://msdn.microsoft.com/en-us/goglobal/cc305152
Which means to be round-trippable Unicode would have to double those
characters, but this would make it hard/impossible to round-trip with
any other character set that had those characters. No easy solution
here.
Something that has been done before [1] is to map the doubles to the
custom area of the unicode space (0xe000-0xffff). It gives you
round-trip support at the expense of having to handle those characters
yourself. But since postgres doesn't do anything meaningful with
unicode characters this might be acceptable.
[1] Python does a similar trick to handle filenames coming from disk in
an unknown encoding:
http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding
Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2013-07-16 21:31:03 | Re: pg_filedump 9.3: checksums (and a few other fixes) |
Previous Message | Martijn van Oosterhout | 2013-07-16 20:42:33 | Re: pg_memory_barrier() doesn't compile, let alone work, for me |