Re: [HACKERS] Wrong charset mappings

From: Barry Lind <blind(at)xythos(dot)com>
To: Thomas O'Dowd <tom(at)nooper(dot)com>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: [HACKERS] Wrong charset mappings
Date: 2003-02-12 17:54:12
Message-ID: 3E4A8A44.7040600@xythos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-jdbc

I don't see any jdbc specific requirements here, other than the fact
that jdbc assumes that the following conversions are done correctly:

dbcharset <-> utf8 <-> java/utf16

where the dbcharset to/from utf8 conversion is done by the backend and
the utf8 to/from java/utf16 is done in the jdbc driver.

Prior to 7.3 the jdbc driver did the entire conversion itself. However
versions of the jdk prior to 1.4 do a terrible job when it comes to the
performance of the conversion. So for a significant speed up in 7.3 we
moved most of the work to the backend.

thanks,
--Barry

Thomas O'Dowd wrote:
> Hi Ishii-san,
>
> Thanks for the reply. Why was the particular change made between 7.2 and
> 7.3? It seems to have moved away from the standard. I found the
> following file...
>
> src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
>
> Which generates the mappings. I found it references 3 files from unicode
> organisation, namely:
>
> http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
> http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
> http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
>
> The JIS0208.TXT has the line...
>
> 0x8160 0x2141 0x301C # WAVE DASH
>
> 1st col is sjis, 2nd is EUC - 0x8080, 3rd is utf16.
>
> Incidently those mapping files are marked obsolete but I guess the old
> mappings still hold.
>
> I guess if I run the perl script it will generate a mapping file
> different to what postgresql is currently using. It might be interesting
> to pull out the diffs and see what's right/wrong. I guess its not run
> anymore?
>
> I can't see how the change will affect the JDBC driver. It should only
> improve the situation. Right now its not possible to go from sjis ->
> database (utf8) -> java (jdbc/utf16) -> sjis for the WAVE DASH character
> because the mapping is wrong in postgresql. I'll cc the JDBC list and
> maybe we'll find out if its a real problem to change the mapping.
>
> Changing the mapping I think is the correct thing to do from what I can
> see all around me in different tools like iconv, java 1.4.1, utf-8
> terminal and any unicode reference on the web.
>
> What do you think?
>
> Tom.
>
> On Wed, 2003-02-12 at 22:30, Tatsuo Ishii wrote:
>
>>I think the problem you see is due to the the mapping table changes
>>between 7.2 and 7.3. It seems there are more changes other than
>>u301c. Moreover according to the recent discussion in Japanese local
>>mailing list, 7.3's JDBC driver now relies on the encoding conversion
>>performed by the backend. ie. The driver issues "set client_encoding =
>>'UNICODE'". This problem is very complex and I need time to find good
>>solution. I don't think simply backout the changes to the mapping
>>table solves the problem.
>>
>>
>>>Hi all,
>>>
>>>One Japanese character has been causing my head to swim lately. I've
>>>finally tracked down the problem to both Java 1.3 and Postgresql.
>>>
>>>The problem character is namely:
>>>utf-16: 0x301C
>>>utf-8: 0xE3809C
>>>SJIS: 0x8160
>>>EUC_JP: 0xA1C1
>>>Otherwise known as the WAVE DASH character.
>>>
>>>The confusion stems from a very similar character 0xFF5E (utf-16) or
>>>0xEFBD9E (utf-8) the FULLWIDTH TILDE.
>>>
>>>Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
>>>maps correctly to both the correct SJIS and EUC-JP character. Previously
>>>(at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
>>>causing all sorts of trouble.
>>>
>>>Postgresql at least picked one of the two characters namely 0xFF5E, so
>>>conversions in and out of the database to/from sjis/euc seemed to be
>>>working. Problem is when you try to view utf-8 from the database or if
>>>you read the data into java (utf-16) and try converting to euc or sjis
>>>from there.
>>>
>>>Anyway, I think postgresql needs to be fixed for this character. In my
>>>opinion what needs to be done is to change the mappings...
>>>
>>>euc-jp -> utf-8 -> euc-jp
>>>====== ======== ======
>>>0xA1C1 -> 0xE3809C 0xA1C1
>>>
>>>sjis -> utf-8 -> sjis
>>>====== ======== ======
>>>0x8160 -> 0xE3809C 0x8160
>>>
>>>As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
>>>probably should be removed. Maybe you could keep the mapping back to the
>>>sjis/euc characters to help backward compatibility though. I'm not sure
>>>what is the correct approach there.
>>>
>>>If anyone can tell me how to edit the mappings under:
>>> src/backend/utils/mb/Unicode/
>>>
>>>and rebuild postgres to use them, then I can test this out locally.
>>
>>Just edit src/backend/utils/mb/Unicode/*.map and rebiuld
>>PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and
>>euc_jp_to_utf8.map.
>>--
>>Tatsuo Ishii

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2003-02-12 18:00:42 Re: Incremental backup
Previous Message Tom Lane 2003-02-12 17:15:45 Re: Q about InsertIndexResult

Browse pgsql-jdbc by date

  From Date Subject
Next Message Christopher Elkins 2003-02-12 19:00:03 Re: Datasource and tomcat, Postgresql 7.4, jkd1.4.1 --
Previous Message Barry Lind 2003-02-12 17:35:52 Re: Character encoding problem