Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Date: 2020-10-30 05:38:30
Message-ID: CA+HiwqEAcSaj6XC-DdzJtUdQi0Ds=+G202F3Y2Q-mmAaPkRviw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote in
> > However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> > encoding, the convert function returns the correct result. See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> > convert
> > ---------
> > \x817c
> > (1 row)
>
> It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
> but maybe because it was used widely.
>
> So ping-pong between Unicode and SJIS behaves like this:
>
> U+2212 => 0x817c(at)sjis => U+ff0d => 0x817c(at)sjis ...

Is it the following piece of code in UCS_TO_SJIS.pl that manually adds
the mapping?

# Add these UTF8->SJIS pairs to the table.
push @$mapping,
...
{
direction => FROM_UNICODE,
ucs => 0x2212,
code => 0x817c,
comment => '# MINUS SIGN',
f => $this_script,
l => __LINE__
},

Given that U+2212 is encoded by e28892 in utf8, I assume that's how
utf8_to_sjis.map ends up with the following mapping into sjis for that
byte sequence:

/*** Three byte table, leaf: e288xx - offset 0x004ee ***/

/* 80 */ 0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de,
/* 88 */ 0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000,
/* 90 */ 0x0000, 0x8794, "0x817c", ...

> > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > MINUS SIGN in SJIS and that is what we expect. Isn't it?
>
> I think we don't change authoritative mappings, but maybe can add some
> one-way conversions for the convenience.

Maybe UCS_TO_EUC_JP.pl could do something like the above.

Are there other cases that were fixed like this in the past, either
for euc_jp or sjis?

--
Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Takashi Menjo 2020-10-30 05:57:05 Re: [PoC] Non-volatile WAL buffer
Previous Message Andres Freund 2020-10-30 05:08:52 Re: Online checksums verification in the backend