Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: amitlangote09(at)gmail(dot)com
Cc: ashu(dot)coek88(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Date: 2020-10-30 07:56:38
Message-ID: 20201030.165638.1664587537743852598.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 30 Oct 2020 16:33:01 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09(at)gmail(dot)com> wrote in
> I'm not sure how we should construct our won mapping, but the
> difference made by we simply moved to JIS0208.TXT based as Ishii-san
> suggested the differences in the mapping would be as the follows.

Mmm..

I'm not sure how we should construct our won mapping, but the
difference made by simply moving to JIS0208.TXT-based as Ishii-san
suggested, the following differences would be seen in the mappings.

> 1. The following codes (regions) are not defined in JIS0208.
>
> 8ea1 - 8edf (up to 64 characters (I didn't actually counted them.))
> ada1 - adfc (up to 92 characters (ditto))
> 8ff3f3 - 8ff4a8 (up to 182 characters (ditto))

8ea1 - 8edf (64 chars. U+ff61 - U+ff9f) (hankaku-kana)
ada1 - adfc (83 chars, U+2460 - U+33a1) (numbers with cicle)
8ff3f3 - 8ff4a8 (20 chars, U+2160 - U+2179) (roman numerals)

> a1c0 ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS)
> 8ff4aa ff07: (ff07: FULLWIDTH APOSTROPHE)
>
> 2. some individual differences
>
> EUC 0208 932
> a1c1 301c ff5e: (301c:WAVE DASH)
> a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO)
> * a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS)
> d1f1 a2 ffe0: (00a2: CENT SIGN) : (ffe0: FULLWIDTH CENT SIGN)
> d1f2 a3 ffe1: (00a3: PUND SIGN) : (ffe1: FULLWIDTH POUND SIGN)
> a2cc ac ffe2: (00ac: NOT SIGN) : (ffe2: FULLWIDTH NOT SIGN)
>
>
> *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
>
> > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > > > MINUS SIGN in SJIS and that is what we expect. Isn't it?
> > >
> > > I think we don't change authoritative mappings, but maybe can add some
> > > one-way conversions for the convenience.
> >
> > Maybe UCS_TO_EUC_JP.pl could do something like the above.
> >
> > Are there other cases that were fixed like this in the past, either
> > for euc_jp or sjis?
>
> Honestly, I don't know how the mapping was decided in 2002, but
> removing the regions in 1 would cause confusion. So what we can do in
> this area would be chaning some of 2 to 0208 mapping. But arbitrary
> mixture of different mapings would cause new problem..

Forgot about adding one-way mappings. I think we can add several
such mappings, say.

U+3031->: EUC:a1c1 <-> U+ff5e
U+2016->: EUC:a1c2 <-> U+2225
U+2212->: EUC:a1dd <-> U+ff0d
U+00a2->: EUC:d1f1 <-> U+ffe0
U+00a3->: EUC:d1f2 <-> U+ffe1
U+00ac->: EUC:a2cc <-> U+ffe2

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2020-10-30 08:26:18 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Kyotaro Horiguchi 2020-10-30 07:33:01 Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8