Skip site navigation (1) Skip section navigation (2)

Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

From: "J(dot) Michael Crawford" <jmichael(at)gwi(dot)net>
To: pgsql-general(at)postgresql(dot)org, pgsql-jdbc(at)postgresql(dot)org
Subject: Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.
Date: 2004-11-08 16:07:34
Message-ID: 6.1.2.0.2.20041108105839.02fdcfe0@pop.suscom-maine.net (view raw or flat)
Thread:
Lists: pgsql-generalpgsql-jdbc
   I am posting this to the General and JDBC groups so that it's in the 
archive -- I know many people have had problems with Latin1 characters, and 
I'd like to share what has worked for us.  If anyone can add this 
information to a more permanent FAQ, I'd be much obliged.

---

Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

   This is a general primer for using postgres with alternate character 
sets.  For those who have done so successfully, the process is obvious in 
hindsight.  For those who haven't gotten it to work yet, it can be very 
complex, especially with web applications where four separate encodings can 
come into play.

   Postgres is quite intelligent in dealing with character sets, but even 
so, it's possible to get into a situation where your application returns 
garbage for Latin1 or other extended characters (n with a tilde, 
etc.).  This is particularly true in Java-based web applications.

   There are four steps to ensuring proper character retrieval, with the 
fourth applying to Java web applications.  They are explained briefly here, 
and then in more detail further down in the document:

   1.  Use a database encoding that will store the appropriate characters.

   2.  Set the client encoding first, before importing or adding data to 
the database.

   3.  Set the client encoding for each session where a user or program 
accesses the database.

   4.  Set the proper character encoding for the web page.

   Do these four steps, and things should work out nicely.  Miss even one 
of them, and your application will mysteriously return goofy extended 
characters.

   For anyone who has already done this, the solution is probably 
obvious.  However, it took us months to figure this out through 
trial-and-error, even with the help of the Java and Postgres 
communities.  After all, we had five different potential encodings to deal 
with, in four separate steps, and that yields a LOT of non-working 
combinations.

   We found dozens of people with the same problem, but most of them never 
fully resolved their problems.  We tried a heck of a lot of different 
strategies, but the only thing that worked was breaking the situation down 
into these steps, and finding the right encoding combination for each step, 
one-at-a-time.  Now that our application works, we want to help minimize 
the number of people who have to reinvent this wheel from scratch.

   The rest of this document explains the steps in more detail.

---

1.  Create the database with the proper encoding.

   In our experience, the best thing to do is simply choose Unicode as the 
database's character encoding if you think there's a chance of storing 
Latin1 or other characters.

   You could choose Latin1, and this should work in most cases.  However, 
there are times when normal-looking characters refuse to be stored in a 
Latin1 database, such as character 239, which is the same in Latin1 as it 
is is Utf-8.  Rather than attempt to beat our heads against this wall, we 
went with Unicode because it will hold whatever we need to hold.

---

2.  Set client encoding before importing or adding data to the database.

   If you do a mass import from another database, or a data retore from 
Postgres, make sure the encoding is set first.  If you try to restore a 
unicode characters to a Latin1 database, the data likely won't look 
right.  Likewise, if you try to restore Latin1 characters into a Unicode 
database without first telling Postgres that it's getting Latin1 
characters, the extended characters will come back as garbage, 
*even*though* a unicode database can hold them.

   The command to set the client encoding for a session is:

     "SET CLIENT_ENCODING TO Unicode", "SET CLIENT_ENCODING TO Latin1", and 
so forth.

   If you find out what kind of characters are coming into the postgres 
database, and tell Postgres ahead of time, it should be able to 
automatically translate them.  As it is designed to do, of course.  Just 
remember that unicode is a HUGE character set, and there will be unicode 
characters that can't be translated into the much smaller Latin1 character set.

   You may have to experiment to find out what the right client encoding 
is.  It will likely depend upon the database from which you're getting 
data, such as a backup of a Latin1 postgres database, or data retrieved 
from a SQL Server database, which will have its own encoding.  It could 
also depend upon the character set used by your operating system, such as 
the character set of data someone entered into a Microsoft Access database 
(which, if they used the US English Windows, will be cp1252).  Fortunately 
a Unicode or Latin1 client encoding will handle most Windows-related 
encoding issues.

---

3. Set the client encoding for each session where a user or program 
accesses the database.

   If you have users entering data from a Java application, the encoding 
will vary from JVM to JVM.  A windows-based JVM might have an encoding of 
cp1252, while a Linus JVM might be utf-8.  A windows-database application 
will likely ave a cp1252 or Latin1 character set, but that could depend 
upon what application you're using or what version of Windows you're using.

   Postgres needs to know what kind of characters to store and return.  If 
you tell Postgres to return unicode characters to an application that's 
using Latin1, extended characters will look like gibberish.

   Even in Java, where you can do all sorts of character-encoding 
translation, it can be impossible to translate data retrieved from Postgres 
if it's in the wrong encoding.  We've tried changing the JVM encoding, 
altering the jdbc driver, translating encodings on the database read, and 
translating encodings after the read while building a new string, to no 
avail.  We tried 25 combinations of each strategy (five different possible 
read encodings and five different possible string encodings), and nothing 
worked.  We could get an application working in one JVM with one encoding, 
but another JVM would break, and no amount of translation would help.

   But when we finally told Postgres what to return, everythign worked like 
a charm.

   Just as with step two, the key is to use the "SET CLIENT_ENCODING TO 
(encoding)" sql command.  If you're using an application where you can send 
SQL to the server, this is all you need.  In something like MS Access, 
you'll have to move to a passthrough query.  For Java, you'll need to send 
a command through JDBC:

String DBEncoding = "Unicode"  //use a real encoding, either returned from 
the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING 
TO '" + DBEncoding + "'");
statement.execute();

---

4.  Set the proper character encoding for the web page.

   Web browsers also need to know what kind of characters they are 
displaying.  If you send them Unicode characters and the browser thinks 
it's getting Latin1, extended characters will again come up as 
nonsense.  If you're writing a web-based application, it's a good idea to 
include the character set as part of the html, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

   For example, we have an application that serves up pages perfectly from 
a Linux JVM, which has a utf-8 encoding, and serves up pages with goofy 
extended characters from a Windows JVM, which has a cp1252 encoding.  The 
solution is either to translate the strings in the Windows JVM into unicode 
before serving the page, or alternately, tell the web page to use a 
different character set.


---

   Those are all the basics.  Many times you may not have to worry about 
one or more of the steps because your application server or JSP generator 
may already handle the character set conversion.  But if it doesn't, or if 
you're writing Servlets and not JSP's, these steps are good ones to 
consider if you work with Latin1 characters.






In response to

Responses

pgsql-jdbc by date

Next:From: Kris JurkaDate: 2004-11-08 17:15:19
Subject: Re: [JDBC] Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8)
Previous:From: Pallav KalvaDate: 2004-11-08 15:01:04
Subject: Installing pgbench

pgsql-general by date

Next:From: Marc G. FournierDate: 2004-11-08 16:51:50
Subject: Re: RFD: comp.databases.postgresql.general
Previous:From: Gary L. BurnoreDate: 2004-11-08 16:07:07
Subject: Re: RFD: comp.databases.postgresql.general

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group