Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

From: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Date: 2011-06-07 07:36:27
Message-ID: BANLkTimJWsSxko3HU-qsGnNR4Hk8u5eHvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

Issue is on Windows:

If you see in attached failure.out file, (after running failure.sql) we are
getting "ERROR: invalid
byte sequence for encoding "UTF8": 0xe59aff" error. Please note that byte
sequence we got from database is e5 9a ff, where as actual byte sequence for
the wide character '功' is e5 8a 9f.

'功' ==> UNICODE Character
e5 8a 9f ==> Original Byte Sequence for the given characters
e5 9a ff ==> downcase_truncate_identifier() result, which is invalid UTF8
representation, stored in pg_catalog table

While displaying on client, we receive this invalid byte sequence which
throws an error. Note that UTF8 characters have predefined character ranges
for each byte which is checked in pg_utf8_islegal() function. Here is the
code snippet:

==
a = source[2];
if (a < 0x80 || a > 0xBF)
return false;
==
Note that source[2] = ff, which does not fall into the valid range which
results in illegal UTF8 character sequence. If you carefully see the
original one i.e. 9f, which falls within the range.

since we smash the identifier to lower case using
downcase_truncate_identifier() function, the solution is to make this
function should be wide-char aware, like LOWER() function functionality.

I see some discussion related to downcase_truncate_identifier() and
wide-char aware function, but seems like we lost somewhere.
(http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
This invalid byte sequence issue seems like a more serious issue, because it
might lead e.g to pg_dump failures.

I have tested this on PG9.0 beta4 (one click installers), BTW, we have
observed same with earlier version as well.

Attached is the .sql and its output (run on PG9.0 beta4).

Any thoughts???

Thanks

--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are not
the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Attachment Content-Type Size
failure.sql text/x-sql 368 bytes
failure.out application/octet-stream 1.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-06-07 07:55:05 Re: SIREAD lock versus ACCESS EXCLUSIVE lock
Previous Message Heikki Linnakangas 2011-06-07 07:27:27 Re: WALInsertLock tuning