Re: Character Encoding problem

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: Character Encoding problem
Date: 2008-04-07 04:35:37
Message-ID: 47F9A499.3000803@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

antony baxter wrote:
> Hi Craig - thanks for replying.
>
>> What's your terminal encoding? The easiest way to find out is to look at
>> the output of the `locale' command.
>
> ant(at)home (/Users/ant) % locale
> LANG="en_GB.UTF-8"

Hmm, ok. I was expecting a non-utf-8 locale.

>
>> What is the output of the psql '\encoding' command when you've just made a
>> new connection, and without touching the client_encoding setting?
>
> testdb=# \encoding
> UTF8

OK, so your psql client encoding matches your db locale and your local 8
bit text codec for your terminal etc.

>> To perform a valid test you'd need to extract the raw byte sequence from
>> the log, encode it as an escaped octal/hex string in psql, and send that.
>
> Ok. I tried cutting and pasting just some cyrillic from my log file
> into 'test.txt' and then did:
>
> ant(at)home (/Users/ant) % file test.txt
> test.txt: UTF-8 Unicode text
>
> Which I assumed was enough?

Only because all your locales are the same. Copy and paste via a
terminal is potentially subject to text encoding conversions by tools
like `less' and `vim'.

A reliable test would be to use something like Perl or Python to open
the file (in plain old 8-bit binary mode, no encoding conversions etc),
seek to the appropriate part, and read the sequence of bytes of interest.

> Ok. Directly after the Java INSERT:
>
> testdb=# select name::bytea from testTable where id = '1';
> first_name
> ------------------------------------------------------------------------------------------------------
> \342\200\223\303\253\342\200\224\303\204\342\200\224\303\247\342\200\223\316\251\342\200\224\303\207
> (1 row)

That decodes as:

–ë—Ä—ç–Ω—Ç

ie emdash, e-umlaut, emdash, captial-a-umlaut, emdash, c-cedelia,
emdash, omega, emdash, capital-c-cedelia.

... which isn't exactly cyrillic ;-)

More to the point, the regularly inserted emdashes are very informative,
as they suggest that something is mangling the UTF-8 escape bytes.

> Then I manually UPDATE and try again:
>
> testdb=# select name::bytea from testTable where id = '1';
> first_name
> ------------------------------------------
> \320\221\321\200\321\215\320\275\321\202

... which decodes as a much more sensible 'Брэнт'. At least it's cyrillic.

>> If you INSERT that string into a text/varchar column in the DB as an
>> escaped string as above, then SELECT it back as bytea, you should get the
>> original sequence of octal values back. If you select it as text, you should
>> get the same value as the string shown above in my email. If you insert the
>> non-escaped string as written in my email and select that back out as bytea,
>> you should get the same sequence of octal values as shown in my mail.
>
> Both appear correct - i.e. SELECT name shows me correctly accented +
> Chinese text, and SELECT name::bytea shows me the same bytes.

Yep, it's now clear that you're getting the right behaviour from psql.
Probably perl, too, since it's the same 8-bit encoding all the way through.

>> Are you explicitly handling conversions from utf-8 byte strings from psql
>> to the local 8 bit encoding, or setting client_encoding to your local 8 bit
>> encoding to get the pg libraries to do it for you?
>
> Neither, as far as I'm aware. I'm running everything (I think) in UTF-8.

It's probably a good idea to set the client encoding explicitly to your
local encoding anyway, or do explicit utf-8 <> local8bit conversions on
data from the DB. Testing your perl code in a non-utf-8 locale

>> It's actually rather a lot more likely that the rest of your tests are
>> inserting mangled data and then retrieving it by unmangling it the same way,
>> and the Java code is getting it right. I see this sort of thing a lot when
>> people are working with 8-bit text of different encodings, or converting
>> between "Unicode" datatypes and 8-bit strings that don't have an inherent
>> encoding.
>
> Yes - I think thats probably right, except that everything else
> *seems* to be functioning correctly...

Yep, my bad. So now we look at your strings in Java. Are you certain
that the input you're feeding to the JDBC is correctly encoded utf-8
text? I've frequently seen a "unicode" string like java.lang.String or
Qt's QString being filled character by character from 8-bit input,
resulting in a hopelessly mangled string for non-ascii input, especially
in locales where one character isn't necessarily one byte. Treating
utf-8 input as ascii or latin-1 (maybe due to an implicit conversion
somewhere) is another culprit.

--
Craig Ringer

In response to

Browse pgsql-jdbc by date

  From Date Subject
Next Message Craig Ringer 2008-04-07 04:36:56 Re: Character Encoding problem
Previous Message Craig Ringer 2008-04-07 04:34:44 Re: Character Encoding problem