Re: Character Encoding problem

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: antony baxter <antony(dot)baxter(at)gmail(dot)com>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: Character Encoding problem
Date: 2008-04-07 03:38:58
Message-ID: 47F99752.9070501@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

antony baxter wrote:
> However, if using psql I try to look at the data, it is mangled.
What's your terminal encoding? The easiest way to find out is to look at
the output of the `locale' command.

What is the output of the psql '\encoding' command when you've just made
a new connection, and without touching the client_encoding setting?

If your terminal encoding is not utf-8 then setting client_encoding to
utf-8 in psql is incorrect. You should set it to your local 8 bit
encoding (say, latin-1) so psql knows to convert text between the DB
encoding and local encoding.

I'm a little surprised that psql doesn't set client_encoding by default
based on the LC_ALL / LANG environment variables, but it doesn't seem
to. I'm in a UTF-8 locale (en_AU.UTF-8) and even if I run psql as:

LANG=en_AU LC_ALL=en_AU psql

(en_AU is a LATIN-1 locale by default) it still selects a UTF-8 client
encoding. If I spawn a terminal in the en_AU locale and do this I get
mangled text from psql until I set client_encoding manually.

Shouldn't psql set client_encoding based on the user's locale?
> If I
> try a manual UPDATE via psql using the data cut'n'pasted from my log,
> and then look at the data, it reads correctly.
That's not necessarily correct. If your terminal encoding is not UTF-8
then `less' is converting the UTF-8 encoded data in the log to your
locale encoding. Copying and pasting that from a terminal displaying it
with less will copy the /translated/ data, not the original UTF-8 data.

To perform a valid test you'd need to extract the raw byte sequence from
the log, encode it as an escaped octal/hex string in psql, and send that.
> Therefore I know that
> psql is capable of reading and writing UTF8 data correctly.
It is, but I'm not sure that your tests show that. I think you're
silently inserting data in a different (non-UTF-8) encoding into the
database.

If you can compare the byte sequences of the value when you've just set
it from Java to when you've just set it from psql I could tell you for
sure. Casting the text to a bytea should do the job, eg running:

SELECT problem_column::bytea FROM ...

after you insert/update from java, and again after you insert/update
from psql.

Alternately, just insert the UTF-8 string you want as a bytea. Here's an
example string for you (UTF-8 encoded in my mail client):

SELECT 'äëïöüáéíóú©讓您搜尋和瀏覽最新資訊。'::bytea;

the string is aeiou (all with umlaut), aeiou (all with acute), copyright
symbol, then a bunch of simplified chinese characters from google news.
If it doesn't appear that way in your mail client then it's not
honouring the content encoding set in the message, and you need to
manually tell it to handle the message as utf-8.

The resulting string of UTF-8 encoded bytes is:

E'\303\244\303\253\303\257\303\266\303\274\303\241\303\251\303\255\303\263\303\272\302\251\350\256\223\346\202\250\346\220\234\345\260\213\345\222\214\347\200\217\350\246\275\346\234\200\346\226\260\350\263\207\350\250\212\343\200\202'

If you INSERT that string into a text/varchar column in the DB as an
escaped string as above, then SELECT it back as bytea, you should get
the original sequence of octal values back. If you select it as text,
you should get the same value as the string shown above in my email. If
you insert the non-escaped string as written in my email and select that
back out as bytea, you should get the same sequence of octal values as
shown in my mail.
> Also, the
> client application that reads from my database is Perl, and this also
> retrieves mangled data; we've tried writing and reading directly from
> Perl, and in this case reviewing the data in psql looks normal.
>
What locale is your perl app running in?

Are you explicitly handling conversions from utf-8 byte strings from
psql to the local 8 bit encoding, or setting client_encoding to your
local 8 bit encoding to get the pg libraries to do it for you?
> The conclusion I've reached is that Java + JDBC is not actually
> persisting the data in UTF-8; is that correct or am I wildly off base,
> and if it is correct then is there anything I can do about it?!
>
It's actually rather a lot more likely that the rest of your tests are
inserting mangled data and then retrieving it by unmangling it the same
way, and the Java code is getting it right. I see this sort of thing a
lot when people are working with 8-bit text of different encodings, or
converting between "Unicode" datatypes and 8-bit strings that don't have
an inherent encoding.

--
Craig Ringer

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Craig Ringer 2008-04-07 03:48:54 Re: Character Encoding problem
Previous Message antony baxter 2008-04-07 03:34:01 Re: Character Encoding problem