Re: Proposal - Support for National Characters functionality

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pg(at)heroku(dot)com, robertmhaas(at)gmail(dot)com, pavel(dot)stehule(at)gmail(dot)com, peter_e(at)gmx(dot)net, arul(at)fast(dot)au(dot)fujitsu(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal - Support for National Characters functionality
Date: 2013-07-16 21:07:27
Message-ID: 20130716210727.GD28628@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote:
> > Does support for alternative multi-byte encodings have something to do
> > with the Han unification controversy? I don't know terribly much about
> > this, so apologies if that's just wrong.
>
> There's a famous problem regarding conversion between Unicode and other
> encodings, such as Shift Jis.
>
> There are lots of discussion on this. Here is the one from Microsoft:
>
> http://support.microsoft.com/kb/170559/EN-US

Apart from Shift-JIS not being a well defined (it's more a family of
encodings) it has the unusual feature of providing multiple ways to
encode the same character. This is not even a Han unification issue,
they have largely been addressed. For example, the square-root symbol
exists twice (0x8795 and 0x81E3) and many other mathmatical symbols
also.

Here's the code page which you can browse online:

http://msdn.microsoft.com/en-us/goglobal/cc305152

Which means to be round-trippable Unicode would have to double those
characters, but this would make it hard/impossible to round-trip with
any other character set that had those characters. No easy solution
here.

Something that has been done before [1] is to map the doubles to the
custom area of the unicode space (0xe000-0xffff). It gives you
round-trip support at the expense of having to handle those characters
yourself. But since postgres doesn't do anything meaningful with
unicode characters this might be acceptable.

[1] Python does a similar trick to handle filenames coming from disk in
an unknown encoding:
http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2013-07-16 21:31:03 Re: pg_filedump 9.3: checksums (and a few other fixes)
Previous Message Martijn van Oosterhout 2013-07-16 20:42:33 Re: pg_memory_barrier() doesn't compile, let alone work, for me