Re: Errors in our encoding conversion tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors in our encoding conversion tables
Date: 2015-12-02 17:05:40
Message-ID: CA+Tgmoavwr3ZwwsjSqDU5reZywKnUBKV3uJ-+XXjLG9NNuX2wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 27, 2015 at 8:54 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> I explain why the manual editing is necessary.
>
> One of the most famous problems with Unicode is "wave dash"
> (U+301C). According the Unicode consortium's Unicode/SJIS map, it
> corresponds to 0x8160 of Shift_JIS. Unfortunately this was a mistake
> in Unicode (the glyph of Shift_JIS and Unicode is slightly different -
> looks like to be rotated in 90 degrees of wave dash in vertical
> scripting. Probably they did not understand the Japanese vertical
> writing at that time). So later on the Unicode consortium decided to
> add another "wave dash" as U+FF5E which has a correct glyph of "wave
> dash". However since Unicode already decided that U+301C corresponds
> to 0x8160 of Shift_JIS, there's no Shift_JIS code corresponding to
> U+FF5E. Unlike Unicode's definition, Microsoft defines that 0x8160
> (wave dash) corresponds to U+FF5E. This is widely used in Japan. So I
> decided to hire this for "wave dash". i.e.
>
> 0x8160 -> U+FF5E (sjis_to_utf8.map)
>
> U+301C -> 0x8160 (utf_to_sjis.map)
> U+FF5E -> 0x8160 (utf_to_sjis.map)
>
> Another problem is vendor extension.
>
> There are several standards for SJIS and EUC_JP in Japan. There is a
> standard "Shift_JIS" defined by Japanese Government (probably the
> Unicode consortium's map can be based on this, but I need to
> verify). However several major vendors include IBM, NEC added their
> own additional characters to Shift_JIS and they are widely used in
> Japan. Unfortunately they are not compatible. So as a compromise I and
> other developers decided to "merge" NEC and IBM extension part and
> added to Shift_JIS. Same thing can be said to EUC_JP.
>
> In short, there are number of reasons we cannot simply import the
> consortium's mapping regarding SJIS (and EUC_JP).

I haven't seen a response to this point, but it seems important.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-02 17:07:25 Re: Making the C collation less inclined to abort abbreviation
Previous Message Tom Lane 2015-12-02 17:04:48 Re: psql ignores failure to open -o target file