Re: BUG #4714: Unicode Big5 Conversion

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: 張桂賢 Roger Chang <rchang111(at)gmail(dot)com>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #4714: Unicode Big5 Conversion
Date: 2009-03-18 15:31:19
Message-ID: 49C113C7.5090208@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Heikki Linnakangas wrote:
> 張桂賢 Roger Chang wrote:
>> There is authoritative source for the Big5 encoding, but don't believe
>> that do help
>>
>> http://www.cns11643.gov.tw/AIDB/encodings_en.do
>>
>> Skip the historical mess already done. we should focus on reality?
>>
>> brief events according time-line,
>>
>> * BIG5 created, mostly by ETEN company, some others but not
>> important now.
>> * CNS Standard like 11643, Taiwan Government authority building in
>> mean time ...
>> * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ???
>> Code Page 950 born.
>> * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with
>> IBM5550
>> * Since Windows ME, CP950 add above mentioned 7 char.
>> 0xF9D6-0xF9FE ONLY ???
>> * Later Hong Kong add above 7 Char. plus some more symbol in
>> HKSCS-2004, and what you found is right.
>> * WHAT A MESS !
>>
>> Focus on reality,
>> only mentioned 7 Char. I need to build into pgsql sources to compliant
>> with CP950, since few years ago.
>
> Ok, so Windows codepage 950 has those 7 characters, but not the other
> ETEN extended chars. I think that's a good enough reason to add those 7
> chars; we have 'win950' as an alias for big5 anyway.

I downloaded the latest CP950 - Unicode conversion table from
ftp://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT,
and run it through the UCS_to_most.pl script in
src/backend/utils/mb/Unicode. Comparing the result to our current big5
conversion table, there's a couple of minor differences in the mapping
of punctuation characters, e.g 0xa145 is mapped to Unicode character
2022 BULLET in big5, and to 2027 HYPHENATION POINT in CP950. And we're
missing all the "box drawing" characters in CP950 in the ranges
0xc6a1-0xc6fe, 0xc470-0xc7fc and 0xf9dd-0xf9fe. And then there's the 7
characters you mentioned in the range 0xf9d6-0xf9dc.

So although we use win950 as an alias for big5, it's not the same thing.
I guess we don't care about the box drawing characters, they're not very
useful for a database, and we shouldn't change the mapping of existing
characters on backwards-compatibility reasons. I wondered if we make
win950 a separate encoding, but they seem to be close enough in practice
that it's better to keep them the same.

So again, I'll just go add those 7 characters.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Serbe 2009-03-18 15:47:39 BUG #4716: Error initialiting postgresql
Previous Message Heikki Linnakangas 2009-03-18 14:55:27 Re: BUG #4714: Unicode Big5 Conversion