Re: BUG #1091: Localization in EUC_TW Can't decode Big5

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #1091: Localization in EUC_TW Can't decode Big5
Date: 2004-03-04 03:09:45
Message-ID: 20040304.120945.71083122.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The problem with Big5 is there's no well established standard for
it. Here is an excerption from the famous cjk.txt by Ken Lunde:

----------------------------------------------------------------
2.3.1: BIG FIVE

The Big Five character set is composed of 94 rows of 157
characters each (the 157 characters of each row are encoded in an
initial group of 63 codes followed by the remaining 94 codes). The
following is a break-down of its contents:

o Row 1: 157 symbols
o Row 2: 157 symbols
o Row 3: 94 symbols
o Rows 4 through 38: 5,401 hanzi (Level 1 Hanzi; last is 38-63)
o Rows 41 through 89: 7,652 hanzi (Level 2 Hanzi; last is 89-116)

This forms what I consider to be the basic Big Five set. Actually, two
of the hanzi in Level 2 are duplicates, so there are actually only
7,650 unique hanzi in Level 2.
There are two major extensions to Big Five. The first really
has no name, and can be considered part of the basic Big Five set as
specified above. It adds the following characters:

o Rows 38-39: 4 Japanese iteration marks, 83 hiragana, 86 katakana, 66
uppercase and lowercase Cyrillic (Russian) alphabet, 10 circled
digits, and 10 parenthesized digits

The other extension was developed by a company called ETen
Information System in Taiwan, and is actually considered to be the
most widely used version of Big Five. It provides the following
extensions to Big Five (different from the above extension):

o Rows 38-40: 10 circled digits, 10 parenthesized digits, 10 lowercase
Roman numerals, 25 classical radicals, 15 Japanese-specific symbols,
83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic
(Russian) alphabet, 3 arrows, 10 radical-like hanzi elements, 40
fraction-like digits, and 7 symbols
o Row 89: 7 hanzi, 33 double-lined line-drawing elements, and a black
box

It is *very* important to note that while these two extensions
have many common portions (in particular, hiragana, katakana, the
Cyrillic alphabet, and so on), they do not share the same code points
for such characters.
----------------------------------------------------------------

If someone is sure there's an existing standard for it, including
mappings between Big5 and EUC-TW, Big5 and Unicode, and *also* wish to
provide patches, I will welcome. Meanwhile you could write your own
mapping between Big5 and other encodings. See CREATE CONVERSION
command documents for more details.
--
Tatsuo Ishii

From: "PostgreSQL Bugs List" <pgsql-bugs(at)postgresql(dot)org>
Subject: [BUGS] BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0.
Date: Wed, 3 Mar 2004 22:08:47 -0400 (AST)
Message-ID: <20040304020847(dot)E10A2CF4D3A(at)www(dot)postgresql(dot)com>

>
> The following bug has been logged online:
>
> Bug reference: 1091
> Logged by: yychen
>
> Email address: yychen(at)mail(dot)clhs(dot)tyc(dot)edu(dot)tw
>
> PostgreSQL version: 7.4
>
> Operating system: MS-WIN2000(Run With TAIWAN Big5)
>
> Description: Localization in EUC_TW Can't decode Big5
> 0xFA40--0xFEF0.
>
> Details:
>
> In Localization:
> DataBase
> When i save string (with Big5 0xFA40-0xFEF0) to database (encodinig with
> EUC_TW or UNICODE); and then read it.
> But PostgreSQL Can't decode these.
> According to: ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf.
> 3.3.4: BIG FIVE
>
> Big Five is the encoding system used on machines that support
> MS-DOS or Windows, and also for Macintosh (such as the Chinese
> Language Kit or the fully-localized operating system).
>
> Two-byte Standard Characters Encoding Ranges
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
> first byte range 0xA1-0xFE
> second byte ranges 0x40-0x7E, 0xA1-0xFE
>
> One-byte Characters Encoding Range
> ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
> ASCII 0x21-0x7E
>
> The encoding used on Macintosh is quite similar to the above,
> but has a slightly shortened two-byte range (second byte range up to
> 0xFC only) plus additional one-byte code points, namely 0x80
> (backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE
> ("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis"
> symbol: three dots).
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PostgreSQL Bugs List 2004-03-04 15:20:43 BUG #1092: Memory Fault in PQsetdbLogin
Previous Message PostgreSQL Bugs List 2004-03-04 02:08:47 BUG #1091: Localization in EUC_TW Can't decode Big5 0xFA40--0xFEF0.