PostgreSQL fails to convert decomposed utf-8 to other encodings

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: PostgreSQL fails to convert decomposed utf-8 to other encodings
Date: 2014-08-06 00:42:09
Message-ID: 53E179E1.3060404@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

There's a bug in encoding conversions from utf-8 to other encodings that
results in corrupt output if decomposed utf-8 is used.

PostgreSQL doesn't process utf-8 to pre-composed form first, so
decomposed UTF-8 is not handled correctly.

Take á:

regress=> -- Decomposed - 'a' then 'acute'
regress=> SELECT E'\u0061\u0301';
' ?column?
----------

(1 row)

regress=> -- Precomposed - 'a-acute'
regress=> SELECT E'\u00E1';
?column?
----------
á
(1 row)

regress=> SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
ERROR: character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to(E'\u00E1', 'iso-8859-1');
convert_to
------------
\xe1
(1 row)

This affects input from the client too:

regress=> SELECT convert_to('á', 'iso-8859-1');
ERROR: character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to('á', 'iso-8859-1');
convert_to
------------
\xe1
(1 row)

... yes, that looks like the same function producing different results
on identical input. You might not be able to reproduce with copy and
paste from this mail if your client normalizes UTF-8, but you'll be able
to by printing the decomposed character to your terminal as an escape
string, then copying and pasting from there.

We should've probably been normalizing decomposed sequences to
precomposed as part of utf-8 validation wherever 'text' input occurs,
but it's too late for that now as DBs in the wild will contain
decomposed chars. Instead, conversion functions need to normalize
decomposed chars to precomposed before converting from utf-8 to another
encoding.

Comments?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2014-08-06 01:14:35 Re: PostgreSQL fails to convert decomposed utf-8 to other encodings
Previous Message Anned-Linz Senadin 2014-08-06 00:23:20