Re: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn't

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: michael(dot)enke(at)wincor-nixdorf(dot)com, pgsql-bugs(at)postgresql(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn't
Date: 2003-04-12 01:51:45
Message-ID: 20030412.105145.74752700.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

It turned out that it's a bug with encoding conversion engine of
PostgreSQL. It just failed to find proper entry from a encoding
conversion table because of a integer overflow problem. Since only
maps for EUC_TW have such a huge code point values (for example
0x8eaee7aa), I believe the conversion failure merely occurs with the
particular encoding. Included patches should solve the problem (it is
against PostgreSQL 7.3.2).

BTW, I'm surprised to find the bug since it has been there since 7.2
days.

I'm going to commit the fix to both current and 7.3-stable trees.
--
Tatsuo Ishii

> Short Description
> Server-Encoding from EUC_TW to UTF-8 doesn't work
>
> Long Description
> System: SuSE Linux 8.1, kernel 2.4.19, glibc 2.2.5/glibc-locale 2.2.5
> the same error on RedHat 7.3, kernel 2.4.20, glibc2.2.5
> postgresql version 7.3.2
> description: I loaded Chinese (TW) characters, encoded as UTF-8 into a
> database which has UTF-8 encoding with "copy table from 'original'" with psql. Ok.
> Than I exit from psql, exported PGCLIENTENCODING=EUC_TW
> I started psql, make a "copy table to 'file.EUC_TW'". Ok.
> If I convert this file to UTF-8 with iconv -f EUC-TW -t UTF-8 file.EUC_TW file.UTF-8
> than file.UTF-8 looks ecaxtly the same as the original.
> That means, PostgreSQL converts from UTF-8 to EUC_TW correct.
> Now I load the exported file 'file.EUC_TW' back into DB:
> "copy table2 from 'file.EUC_TW'", still I did not finish psql,
> PGCLIENTENCODING is the same as for "copy to".
> Now I get error telling me: "copy: line 1, LocalToUtf: could not convert (0xe5b5) EUC_TW to UTF-8" ... and the characters are missing in table2
>
> Sample Code
> UTF-8:
> 00000000: e795 b6e6 97a5 0ae5 959f e58b 95e4 b8ad
> 00000010: 2ce4 bd86 e69c 89e9 8caf e8aa a40a
>
> EUC_TW as exported from PostgreSQL and not imported:
> 00000000: e5b5 c5ca 0ada f6d9 afc4 e32c c8fe c8b4
> 00000010: f2e3 eba8 0a

*** src/backend/utils/mb/conv.c.orig 2003-04-12 10:03:25.000000000 +0900
--- src/backend/utils/mb/conv.c 2003-04-12 10:16:04.000000000 +0900
***************
*** 313,319 ****

v1 = *(unsigned int *) p1;
v2 = ((pg_utf_to_local *) p2)->utf;
! return (v1 - v2);
}

/*
--- 313,319 ----

v1 = *(unsigned int *) p1;
v2 = ((pg_utf_to_local *) p2)->utf;
! return (v1 > v2)?1:((v1 == v2)?0:-1);
}

/*
***************
*** 328,334 ****

v1 = *(unsigned int *) p1;
v2 = ((pg_local_to_utf *) p2)->code;
! return (v1 - v2);
}

/*
--- 328,334 ----

v1 = *(unsigned int *) p1;
v2 = ((pg_local_to_utf *) p2)->code;
! return (v1 > v2)?1:((v1 == v2)?0:-1);
}

/*

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergey Tikhonenko 2003-04-12 03:10:53 Re: Bug #938: Wrong UPDATE if exist INNER JOIN and alias
Previous Message Robert Creager 2003-04-12 01:45:17 heap_mark4update: (am)invalid tid

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Conway 2003-04-12 02:27:29 Re: Upgrade to RedHat 9.0 broke PostgreSQL
Previous Message Ron Peacetree 2003-04-12 00:01:40 Re: No merge sort?