Re: Wrong charset mappings

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: tom(at)nooper(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Wrong charset mappings
Date: 2003-02-12 13:30:02
Message-ID: 20030212.223002.71090228.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-jdbc

I think the problem you see is due to the the mapping table changes
between 7.2 and 7.3. It seems there are more changes other than
u301c. Moreover according to the recent discussion in Japanese local
mailing list, 7.3's JDBC driver now relies on the encoding conversion
performed by the backend. ie. The driver issues "set client_encoding =
'UNICODE'". This problem is very complex and I need time to find good
solution. I don't think simply backout the changes to the mapping
table solves the problem.

> Hi all,
>
> One Japanese character has been causing my head to swim lately. I've
> finally tracked down the problem to both Java 1.3 and Postgresql.
>
> The problem character is namely:
> utf-16: 0x301C
> utf-8: 0xE3809C
> SJIS: 0x8160
> EUC_JP: 0xA1C1
> Otherwise known as the WAVE DASH character.
>
> The confusion stems from a very similar character 0xFF5E (utf-16) or
> 0xEFBD9E (utf-8) the FULLWIDTH TILDE.
>
> Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
> maps correctly to both the correct SJIS and EUC-JP character. Previously
> (at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
> causing all sorts of trouble.
>
> Postgresql at least picked one of the two characters namely 0xFF5E, so
> conversions in and out of the database to/from sjis/euc seemed to be
> working. Problem is when you try to view utf-8 from the database or if
> you read the data into java (utf-16) and try converting to euc or sjis
> from there.
>
> Anyway, I think postgresql needs to be fixed for this character. In my
> opinion what needs to be done is to change the mappings...
>
> euc-jp -> utf-8 -> euc-jp
> ====== ======== ======
> 0xA1C1 -> 0xE3809C 0xA1C1
>
> sjis -> utf-8 -> sjis
> ====== ======== ======
> 0x8160 -> 0xE3809C 0x8160
>
> As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
> probably should be removed. Maybe you could keep the mapping back to the
> sjis/euc characters to help backward compatibility though. I'm not sure
> what is the correct approach there.
>
> If anyone can tell me how to edit the mappings under:
> src/backend/utils/mb/Unicode/
>
> and rebuild postgres to use them, then I can test this out locally.

Just edit src/backend/utils/mb/Unicode/*.map and rebiuld
PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and
euc_jp_to_utf8.map.
--
Tatsuo Ishii

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Kalchev 2003-02-12 13:39:36 Re: PostgreSQL Windows port strategy
Previous Message Kevin Brown 2003-02-12 13:24:30 Re: location of the configuration files

Browse pgsql-jdbc by date

  From Date Subject
Next Message Boris Klug 2003-02-12 14:50:11 Character encoding problem
Previous Message Juan Francisco De Paz Santana 2003-02-12 12:37:36 getTableName