Re: Patch for collation using ICU

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: john(at)geeknet(dot)com(dot)au
Cc: pgman(at)candle(dot)pha(dot)pa(dot)us, girgen(at)pingpong(dot)net, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-10 07:44:48
Message-ID: 20050510.164448.71085314.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Tatsuo Ishii wrote:
> > Sent: Tuesday, May 10, 2005 12:32 AM
> > To: John Hansen
> > Cc: pgman(at)candle(dot)pha(dot)pa(dot)us; girgen(at)pingpong(dot)net;
> > pgsql-hackers(at)postgresql(dot)org
> > Subject: Re: [HACKERS] Patch for collation using ICU
> >
> > > > -----Original Message-----
> > > > From: Tatsuo Ishii [mailto:t-ishii(at)sra(dot)co(dot)jp]
> > > > Sent: Sunday, May 08, 2005 11:08 PM
> > > > To: John Hansen
> > > > Cc: pgman(at)candle(dot)pha(dot)pa(dot)us; girgen(at)pingpong(dot)net;
> > > > pgsql-hackers(at)postgresql(dot)org
> > > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > >
> > > > > > I don't buy it. If current conversion tables does the
> > > > right thing,
> > > > > > why we need to replace. Or if conversion tables are not
> > > > correct, why
> > > > > > don't you fix it? I think the rule of character
> > > > conversion will not
> > > > > > change frequently, especially for LATIN languages. Thus
> > > > maintaining
> > > > > > cost is not too high.
> > > > >
> > > > > I never said we need to, but if we're going to implement
> > > > ICU, then we
> > > > > might as well go all the way.
> > > >
> > > > So you admit there's no benefit using ICU for replacing existing
> > > > conversions?
> > > >
> > > > Besides ICU does not support all existing conversions, I
> > think ICU
> > > > has serious flaw for using conversion. If I understand correctly,
> > > > ICU uses UNICODE internally to do the conversion. For example, to
> > > > implement
> > > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > > > converts UNICODE to EUC_JP. Problem is these conversion
> > is not roud
> > > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some
> > > > information). Thus SJIS->EUC_JP->SJIS conversion using
> > ICU does not
> > > > preserve original text.
> > >
> > > Just for the record, I fetched a web page encoded in sjis, and
> > > converted it to euc-jp and back using uconv from ICU 3.2, and the
> > > result is the original is identical to the transformed file.
> > >
> > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> > uconv -f
> > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff
> > index.html
> > > index.html.sjis
> >
> > Not all SJIS/EUC_JP characters have the problem. You might want to
> > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> >
> > BTW, I got this with ICU 3.2:
> >
> > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
> > Conversion from Unicode to codepage failed at input byte
> > position 0. Unicode: 301c Error: Invalid character found
> >
> > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
>
> That actually makes perfect sense, since according to unicode.org's
> database:
> 301C ~ WAVE DASH
> This character was encoded to match JIS C 6226-1978 1-33 "wave
> dash".
> The JIS standards and some industry practise disagree in mapping.
> - 3030 wavy dash
> - FF5E full width tilde
>
> In PG FF5E is the mapping currently used. That is obviously wrong
> (according to the standards), as that is only a 'similar character'.
>
> Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
> doesn't define "WAVE DASH".
> In all, I believe this behaviour to be correct according to the
> standards.
>
> There'd be nothing to stop us from defining alternative mappings for the
> cases where we deviate from the standard, but the question is, should we
> be non-standard?

You missed the point. EUC_JP 0xa1c1 is a perfect valid data and
uconv -f EUC_JP -t Shift_JIS should convert it to Shift_JIS 0x8160
regardless of the internal of uconv.
--
Tatsuo Ishii

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message palanivel.kumaran 2005-05-10 07:54:11 Please clarify
Previous Message Nicolai Petri 2005-05-10 07:36:59 Adding callback support.