Re: utf8 database not dumping utf8 characters

From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Matt Williams *EXTERN*" <matt(at)williams-tech(dot)net>, "Steve Crawford" <scrawford(at)pinpointresearch(dot)com>
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: utf8 database not dumping utf8 characters
Date: 2012-04-10 12:52:45
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Matt Williams wrote:
> I ran it through xxd and the hex-bytes are different than those of the proper utf8 character:
> 03300a0: 7472 c383 c2b6 6d65 7209 3009 5c4e 0931 tr....mer.0.\N.1 (from the dump file)
> ö : c3b6
> ö : c383 c2b6

That looks like your database does not contain what you think it does.

True, there *are* UTF-8 characters in it, but not the ones you want,
even though everything looks OK on the surface.

Imagine this scenario:
- Database server encoding is UTF8
- Database client encoding is LATIN1
- Application feeds UTF-8 into PostgreSQL.

This can easily happen if the locale of the postgres user account
is ISO8859-1 and the application did not set the PGCLIENTENCODING
environment variable.

The Application stores 'trömer', i.e passes the following bytes to PostgreSQL:
74 72 c3 b6 6d 65 72

PostgreSQL client interprets these bytes as LATIN1, i.e. 'trömer'.

This is converted to UTF-8 and stored in the database as
74 72 c3 83 c2 b6 6d 65 72

When the application retrieves the string, it will get back what
it originally stored, and everybody is happy, that is until somebody
looks closer or wonders why full text search isn't working for
German umlauts.

As to fixing the situation (if the above is actually your problem),
dump the database with -E LATIN1, edit the dump, change LATIN1
to UTF8 in the "SET client_encoding" statement and load it again.

Laurenz Albe

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Kevin Keith 2012-04-10 14:20:29
Previous Message robin 2012-04-10 07:19:58 Re: Giant Postgresql Database Cluster Folder