Skip site navigation (1) Skip section navigation (2)

Re: UTF-8 -> ISO8859-1 conversion problem

From: "J(dot) Michael Crawford" <jmichael(at)gwi(dot)net>
To: Cott Lang <cott(at)internetstaff(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: UTF-8 -> ISO8859-1 conversion problem
Date: 2004-10-29 17:19:39
Message-ID: 6.1.1.1.2.20041029130225.03212c18@pop.suscom-maine.net (view raw or flat)
Thread:
Lists: pgsql-general
   In my experience, there are just some characters that don't want to be 
converted, even if they appear to be part of the normal 8-bit character 
system.  We went to Unicode databases to hold our Latin1 characters because 
of this.  There was even a case where the client was cutting and pasting 
ascii text into our database, and it just wouldn't take some of the 
letters, giving the same error you reported.

   I'm going to send a more detailed post on the topic, but in general, 
we've found that there are four things that need to be done (four, if 
you're not serving up web pages) for Latin1 characters to work on multiple 
platforms.

   1.  Create the database in Unicode so that it will hold anything you 
throw at it.

   2.  When importing data, set the encoding in the script that loads the 
data, or if there's no script, use the "SET CLIENT_ENCODING TO (encoding)" 
command.  Setting the encoding in a tool like pgManager is not always 
enough.  Use this to be sure.

   3.  When retrieving data in a java application, the JVM encoding will 
vary from JVM to JVM, and no attempt on our part to change the JVM encoding 
or translate the encoding of the database strings has worked, either to or 
from the database.  We spent weeks going through every permutation 
getBytes("ISO-8859-1") and related calls we could find, but to no 
avail.  The JVM will tell you it has a new encoding, but postgres will 
return gibberish.  You can translate the bytes, or get a translated string, 
but it's all the same garbage.  The solution: set the client encoding 
manually through a jdbc prepared statement.  Once you set the client 
encoding properly, all seems to be fine:

String DBEncoding = "anEncoding"  //use a real encoding, either returned 
from the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING 
TO '" + DBEncoding + "'");
statement.execute();

   4.  If writing html for a web page, make sure the encoding of the web 
page matches the encoding of the strings you're throwing at it.  So if you 
have a Linux JVM that has a "UTF-8" encoding, the web page will need the 
html equivalent:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

---

   This is likely far more information than you require, but I thought I'd 
add it anyway so that the information is in the archives.  It took us 
months to solve our problem, even with help from the postgres community, so 
I at least want the basics to be posted while I get my act together and 
write something with more detail.

	- Mike


At 12:12 PM 10/29/2004, Cott Lang wrote:
 >ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
 >
 >Running 7.4.5, I frequently get this error, and ONLY on this particular
 >character despite seeing quite a bit of 8 bit. I don't really follow why
 >it can't be converted, it's the same character (239) in both character
 >sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8.
 >
 >Am I flubbing something up?  I'm probably going to (reluctantly) convert
 >to UTF-8 in the database at some point, but it'd sure be nice if this
 >worked without that. :)
 >
 >thanks!
 >
 >
 >
 >
 >
 >
 >
 >---------------------------(end of broadcast)---------------------------
 >TIP 8: explain analyze is your friend


In response to

Responses

pgsql-general by date

Next:From: Bruno Wolff IIIDate: 2004-10-29 17:46:20
Subject: Re: Comment on timezone and interval types
Previous:From: Guy FraserDate: 2004-10-29 17:14:31
Subject: Re: Comment on timezone and interval types

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group