Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Date: 2020-10-30 03:08:51
Message-ID: CA+HiwqGwTDgFBjzVo+TjQMCBWfs-NCZi_FXmgZPypwfMgiE9OQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi All,
>
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
>
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
> convert
> ----------
> \xefbc8d
> (1 row)
>
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.
>
> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"
>
> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> convert
> ---------
> \x817c
> (1 row)
>
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

So we have

a1dd in euc_jp,
817c in sjis,
efbc8d in utf-8

that convert between each other just fine.

But when it comes to

e28892 in utf-8

it currently only converts to sjis and that too just one way:

select convert('\xe28892', 'utf-8', 'sjis');
convert
---------
\x817c
(1 row)

select convert('\x817c', 'sjis', 'utf-8');
convert
----------
\xefbc8d
(1 row)

I noticed that the commit a8bd7e1c6e02 from ages ago removed
conversions from and to utf-8's e28892, in favor of efbc8d, and that
change has stuck. (Note though that these maps looked pretty
different back then.)

--- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
- {0xa1dd, 0xe28892},
+ {0xa1dd, 0xefbc8d},

--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
- {0xe28892, 0xa1dd},
+ {0xefbc8d, 0xa1dd},

Can't tell what reason there was to do that, but there must have been
some. Maybe the Japanese character sets prefer full-width hyphen
minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

--
Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-10-30 03:19:50 Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Previous Message Fujii Masao 2020-10-30 03:00:27 Re: Add statistics to pg_stat_wal view for wal related parameter tuning