Re: Problem with langage encoding

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Stephane Bortzmeyer <bortzmeyer(at)nic(dot)fr>
Cc: Ronald Vyhmeister <rvyhmeister(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Problem with langage encoding
Date: 2008-12-02 00:34:44
Message-ID: 493482A4.8090409@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Stephane Bortzmeyer wrote:
> On Thu, Nov 27, 2008 at 02:34:17AM +0900,
> Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote
> a message of 188 lines which said:
>
>>> ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
>> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG)
>
> No, I don't think so. I think that 0xc3bd is the binary value, and
> 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
> less surprising. It is in Latin-1 but not in WIN1251.

Yes, that would make a lot more sense. From there it's easy, see below:

> In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
> the database was really in Latin-1.

OK, so what we end up with is the following sequence (expressed in
Python, which is always handy for encoding conversions etc):

print "\xc3\xbd".decode("utf-8").encode("latin-1").decode("koi8_r")

to yield: Щ

Assuming that's right, what presumably happened is that the database was
initialized with UTF-8 and loaded with data it was told was in the
latin-1 encoding, but was actually in the koi8_r encoding. Therefore,
utf-8 sequences were generated based on the interpretation of the bytes
for each koi8_r character as the latin-1 character for the same byte
value, so:

Input file: 0xfd (latin-1: ý, koi8_r: Щ)
|
| (input interpreted as latin-1)
v
Database: 0xc3bd (utf-8: ý)

To recover the data you must reverse that process. Thankfully it's going
to be 100% reversible, ie no information has been lost.

To create a tiny test table for the following explanation and fix code I
just ran:

create table ss ( x text) ;
insert into ss (x) values (E'\xc3\xbd');

Now if I:

set client_encoding = "WIN1251";

I get:

test=> select * from ss;
ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"

just like you.

With client_encoding set as utf-8 (which is what my machine is) I get:

test=> select * from ss;
x
---
ý
(1 row)

PostgreSQL's convert() function:
http://www.postgresql.org/docs/current/static/functions-string.html

may be used now to transform your data. It doesn't assume any encoding
for the input string, unlike convert_from and convert_to, so you can use
a statement like this to convert your data:(where 'x' is the string of
mangled data to be converted):

select (
convert_from(
convert(x::bytea, 'utf-8', 'latin-1'),
'koi8_r')
) from ss;

In other words: "Decode the utf-8 sequence in the input and map each
utf-8 code point to the corresponding latin-1 character, outputting one
byte per latin-1 character. Interpret the sequence of bytes just
produced as a sequence of characters in the koi8_r encoding, and map
them to the same characters in the database's internal encoding."

To copy the converted data to a new table:

CREATE TABLE converted (y text);
INSERT INTO converted(y)
SELECT (
convert_from(
convert(x::bytea, 'utf-8', 'latin-1'),
'koi8_r')
) from ss;

Now if I SELECT from the table of converted data, I get the right(?) output:

test=> select * from converted;
y
---
Щ
(1 row)

You can easily wrap that up in a simple SQL function:

CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$
SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r')
$$ LANGUAGE 'SQL' IMMUTABLE;

so you can just:

test=> select fixstr(x) from ss;
fixstr
--------
Щ
(1 row)

--
Craig Ringer

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John Smith 2008-12-02 01:37:09 Re: slow, long-running 'commit prepared'
Previous Message Jason Long 2008-12-01 23:45:15 Re: Monty on MySQL 5.1: "Oops, we did it again"