Re: Illegal SJIS mapping

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Illegal SJIS mapping
Date: 2016-10-18 05:34:33
Message-ID: 20161018.143433.1192646816835803355.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> However, running the script with that doesn't produce exactly what we
> have in utf8_to_sjis.map, either. It's otherwise same, but we have
> some extra mappings:
>
> - {0xc2a5, 0x5c},

0xc2a5 is U+00a5. The glyph is "YEN SIGN" which is corresponding to
0x5c in SJIS. So this is a valid mapping.

In the mean time, Microsoft wants to map U+005c to 0x5c in CP932. The
glyph of U+005c is "REVERSE SOLDIUS" (back slash). So MS
decided that the glyph of U+00x5c is "YEN SIGN" in CP932!

In summary we need to keep both of mappings:

U+00a5 (utf 0xc2a5) -> 0x5c and U+005c -> 0x5c.

Obviously this breaks the round trip conversion between UTF8 and SJIS
encoding in this case though.

> - {0xc2ac, 0x81ca},
U+00ac (NOT SIGN). Exists in SJIS.

> - {0xe28096, 0x8161},

U+2016 (DOUBLE VERTICAL LINE). Exists in SJIS.

> - {0xe280be, 0x7e},

U+213e (OVERLINE). Mapped to acii 0x7e, which is "half width tilde".

> - {0xe28892, 0x817c},

U+2212 (MINUS SIGN). Mapped to "double width minus sign" in SJIS.

> - {0xe3809c, 0x8160},

u+301c (WAVE DASH). Mapped to "double width wave dash" in SJIS.

> Those mappings were added in commit
> a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus
> mapping for the invalid 0xc19c UTF-8 byte sequence was also added by
> that commit, as well a few valid mappings that UCS_to_SJIS.pl also
> produces.
>
> I can't judge if those mappings make sense. If we can't find an
> authoritative source for them, I suggest that we leave them as they
> are, but also hard-code them to UCS_to_SJIS.pl, so that running that
> script produces those mappings in utf8_to_sjis.map, even though they
> are not present in the CP932.TXT source file.

Sounds acceptable.

In summary current PostgreSQL UTF8 <--> SJIS mapping is a somewhat
mixture of SJIS (Shift_JIS) and MS932. There's no cleaner solution to
exodus this situation. I think we need live with it.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-10-18 07:35:27 Re: Password identifiers, protocol aging and SCRAM protocol
Previous Message Kyotaro HORIGUCHI 2016-10-18 04:10:42 Re: Illegal SJIS mapping