Re: Illegal SJIS mapping

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Illegal SJIS mapping
Date: 2016-10-07 20:58:45
Message-ID: 9c544547-7214-aebe-9b04-57624aedde96@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/07/2016 09:50 AM, Kyotaro HORIGUCHI wrote:
> Hi,
>
> I found an useless entry in utf8_to_sjis.map
>
>> {0xc19c, 0x815f},
>
> which is apparently illegal as UTF-8 which postgresql
> deliberately refuses. So it should be removed and the attached
> patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C
> FULLWIDTH REVERSE SOLIDUS) and it is a right mapping.

Yes, I think you're right. Committed, thanks!

> By the way, the file comment at the beginning of UCS_to_SJIS.pl
> is the following.
>
> # Generate UTF-8 <--> SJIS code conversion tables from
> # map files provided by Unicode organization.
> # Unfortunately it is prohibited by the organization
> # to distribute the map files. So if you try to use this script,
> # you have to obtain SHIFTJIS.TXT from
> # the organization's ftp site.
>
> The file was found at the following place thanks to google.
>
> ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/
>
> As the URL is showing, or as written in the file
> Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and
> the *live* definition *may* be found in Unicode Character
> Database. But I haven't found SJIS-related informatin there.
>
> If I'm not missing anything, the only available authority would
> be JIS X 0208/0213 but what should be implmented seems to be
> maybe-modified MS932 for which I don't know the authority.
>
> Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got
> a quite different mapping files from the current ones.
>
> So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
> maintained. If no authoritative information is available, the
> generating script no longer usable. If any other autority is
> choosed, it is to be modified according to whatever the new
> source format is.

The script is clearly intended to read CP932.TXT, rather than
SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

However, running the script with that doesn't produce exactly what we
have in utf8_to_sjis.map, either. It's otherwise same, but we have some
extra mappings:

- {0xc2a5, 0x5c},
- {0xc2ac, 0x81ca},
- {0xe28096, 0x8161},
- {0xe280be, 0x7e},
- {0xe28892, 0x817c},
- {0xe3809c, 0x8160},

Those mappings were added in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus
mapping for the invalid 0xc19c UTF-8 byte sequence was also added by
that commit, as well a few valid mappings that UCS_to_SJIS.pl also produces.

I can't judge if those mappings make sense. If we can't find an
authoritative source for them, I suggest that we leave them as they are,
but also hard-code them to UCS_to_SJIS.pl, so that running that script
produces those mappings in utf8_to_sjis.map, even though they are not
present in the CP932.TXT source file.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-07 21:09:46 Fixing inheritance merge behavior in ALTER TABLE ADD CONSTRAINT
Previous Message Robert Haas 2016-10-07 20:28:56 Re: pgbench vs. wait events