Re: [HACKERS] Wrong charset mappings

From: Thomas O'Dowd <tom(at)nooper(dot)com>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: [HACKERS] Wrong charset mappings
Date: 2003-02-12 15:13:51
Message-ID: 1045062831.13002.5.camel@beast.uwillsee.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-jdbc

Hi Ishii-san,

Thanks for the reply. Why was the particular change made between 7.2 and
7.3? It seems to have moved away from the standard. I found the
following file...

src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl

Which generates the mappings. I found it references 3 files from unicode
organisation, namely:

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT

The JIS0208.TXT has the line...

0x8160 0x2141 0x301C # WAVE DASH

1st col is sjis, 2nd is EUC - 0x8080, 3rd is utf16.

Incidently those mapping files are marked obsolete but I guess the old
mappings still hold.

I guess if I run the perl script it will generate a mapping file
different to what postgresql is currently using. It might be interesting
to pull out the diffs and see what's right/wrong. I guess its not run
anymore?

I can't see how the change will affect the JDBC driver. It should only
improve the situation. Right now its not possible to go from sjis ->
database (utf8) -> java (jdbc/utf16) -> sjis for the WAVE DASH character
because the mapping is wrong in postgresql. I'll cc the JDBC list and
maybe we'll find out if its a real problem to change the mapping.

Changing the mapping I think is the correct thing to do from what I can
see all around me in different tools like iconv, java 1.4.1, utf-8
terminal and any unicode reference on the web.

What do you think?

Tom.

On Wed, 2003-02-12 at 22:30, Tatsuo Ishii wrote:
> I think the problem you see is due to the the mapping table changes
> between 7.2 and 7.3. It seems there are more changes other than
> u301c. Moreover according to the recent discussion in Japanese local
> mailing list, 7.3's JDBC driver now relies on the encoding conversion
> performed by the backend. ie. The driver issues "set client_encoding =
> 'UNICODE'". This problem is very complex and I need time to find good
> solution. I don't think simply backout the changes to the mapping
> table solves the problem.
>
> > Hi all,
> >
> > One Japanese character has been causing my head to swim lately. I've
> > finally tracked down the problem to both Java 1.3 and Postgresql.
> >
> > The problem character is namely:
> > utf-16: 0x301C
> > utf-8: 0xE3809C
> > SJIS: 0x8160
> > EUC_JP: 0xA1C1
> > Otherwise known as the WAVE DASH character.
> >
> > The confusion stems from a very similar character 0xFF5E (utf-16) or
> > 0xEFBD9E (utf-8) the FULLWIDTH TILDE.
> >
> > Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
> > maps correctly to both the correct SJIS and EUC-JP character. Previously
> > (at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
> > causing all sorts of trouble.
> >
> > Postgresql at least picked one of the two characters namely 0xFF5E, so
> > conversions in and out of the database to/from sjis/euc seemed to be
> > working. Problem is when you try to view utf-8 from the database or if
> > you read the data into java (utf-16) and try converting to euc or sjis
> > from there.
> >
> > Anyway, I think postgresql needs to be fixed for this character. In my
> > opinion what needs to be done is to change the mappings...
> >
> > euc-jp -> utf-8 -> euc-jp
> > ====== ======== ======
> > 0xA1C1 -> 0xE3809C 0xA1C1
> >
> > sjis -> utf-8 -> sjis
> > ====== ======== ======
> > 0x8160 -> 0xE3809C 0x8160
> >
> > As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
> > probably should be removed. Maybe you could keep the mapping back to the
> > sjis/euc characters to help backward compatibility though. I'm not sure
> > what is the correct approach there.
> >
> > If anyone can tell me how to edit the mappings under:
> > src/backend/utils/mb/Unicode/
> >
> > and rebuild postgres to use them, then I can test this out locally.
>
> Just edit src/backend/utils/mb/Unicode/*.map and rebiuld
> PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and
> euc_jp_to_utf8.map.
> --
> Tatsuo Ishii
--
Thomas O'Dowd <tom(at)nooper(dot)com>
Nooper.com Mobile Services Inc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2003-02-12 15:39:05 Re: Windows SHMMAX (was: Default configuration)
Previous Message Merlin Moncure 2003-02-12 14:49:45 Re: PostgreSQL Windows port strategy

Browse pgsql-jdbc by date

  From Date Subject
Next Message Barry Lind 2003-02-12 17:35:52 Re: Character encoding problem
Previous Message Boris Klug 2003-02-12 14:50:11 Character encoding problem