bugs with certain Asian multibyte charsets

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: bugs with certain Asian multibyte charsets
Date: 2005-12-24 09:25:33
Message-ID: 20051224.182533.121216450.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have found long standing bug with with certain Asian multibyte
charsets handling(original report was made by Mr. Ishida).

Some text operations on certain Asian charsets such as EUCj-JP code
set 3 (JIS X 0212) make wrong results. As far as I know, these
include:

- strpos
- regular expression

It seems LIKE is not affected by this bug.

The bug has been there since 6.4. The reason we did not notice the bug
is the affected charsts are merely used. Other charsets affected by
the bug are EUC_CN code set 2, 3 (it seems they are not used at all)
and EUC_TW code set 2, 3 (it seems code set 3 is not used). As far as
I know, EUC_KR is not affected (I believe code set 2, 3 is not used in
EUC_KR).

Here are the description of the bug.

- strpos

In EUC_JP database:

SELECT strpos(hextostr('8faaa18faae1'), hextostr('8faae1'));

returns 1, instead of 2. where hextostr() is a hexadecial to string
conversion functin developed by Mr. Ishida. Those three bytes sequence
starting with 8f is a JIS X 0212 letter encoded in EUC-JP (for
example, 8faaa18faae1 consists of 2 EUC_JP letters).

- regexp

SELECT hextostr('8faaa18faaa1') ~ hextostr('8faae1');

returns false instead of true.

details of the bug:

In backend/utils/mb/wchar.c there are functions to convert multibyte
to wchar. When the conversion performed, the second or third byte was
masked by 0x3f and which makes, for example, 8faaa1 and 8faae1 look
same.

I'm going to commit fixes for 7.3-statble, 7.4-stable, 8.0-stable,
8.1-stable and current.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2005-12-24 09:25:34 Re: Fixing row comparison semantics
Previous Message Martijn van Oosterhout 2005-12-24 08:23:42 Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and