Skip site navigation (1) Skip section navigation (2)

Re: Problems with charsets, investigated...

From: Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: Problems with charsets, investigated...
Date: 2004-08-07 08:41:25
Message-ID: 20040807084125.A6091400E5@smtp.ies.inet6.fr (view raw or flat)
Thread:
Lists: pgsql-jdbc
Ok, seems i was really really unable to explain my problem...

1) Database's encoding is set to LATIN1 (we have SQL_ASCII nowhere)
2) JDBC driver requests data to database in UNICODE (hard-coded in driver)
3) String coming from database therefore are UTF-8-encoded. And they are 
correctly transcoded from LATIN1, as the encoding is correctly specified 
in the pg_database for that database. 
4) Java stores internally as UTF-16... but that's only the internal 
representation. Now there seems to be a problem here (see description of 
the work-around below). 
5) Java's file.encoding system property is set to ISO-8859-1 (because we 
have other data coming from LDAP or filesystem, which are encoded in 
ISO-8859-1 anyway) 
6) Our web app choses to display Java Strings accordingly to 
file.encoding, therefore as ISO-8859-1 
7) Bing ! problem: we are now interpreting UTF8-encoded strings (see point 
2/3) as ISO-8859-1 
Therefore all the accentuated characters go wrong !
In all previous versions of the JDBC driver (we started with the one 
coming along with postgresql 7.0 series) coupled with the corresponding 
version of postgresql, the data was correctly retrieved. 

Now, a working work-around looks like:
String correctString = new 
String(stringFromJdbcDriver.getBytes("ISO-8859-1"),"UTF-8"); 
As i interpret it, the java internal transcoding in the driver, from UTF-8 
to UTF-16 didn't occur correctly (for some reason the strings were 
interpreted as ISO-8859-1 instead of UTF-8, whereas the server was 
correctly sending UTF-8/UNICODE strings as requested. and, considering 
that ISO-8859-1 and UTF-8 are both 8 bits charsets, this interpretation is 
technically possible, but practically completely wrong). 
Now this quick and dirty work around is really dirty, and we cannot use it 
in production. 
My patch eliminates the problem, because the JDBC driver gets ISO-8859-1 
(aka LATIN1) strings from the server, therefore java internal transcoding 
into UTF-16 goes ok... 

Is there some property/field/parameter somewhere that we didn't set 
correctly ? 

Oh, and server_encoding is set to LATIN1 in the database. Is that wrong 
(our data is in LATIN1) ? When doing requests from command-line psql, we 
still get the data correctly... wether we launch it as UTF-8 or 
ISO-8859-1: strings always come with the requested encoding, meaning that 
it's 100% sure that the server transcodes correctly. 

Regards,

Alexandre Aufrere

----------------------------------------------------
De : Oliver Jowett <oliver(at)opencloud(dot)com>
A : alexandre(dot)aufrere(at)inet6(dot)fr
Objet : Re: [JDBC] Problems with charsets, investigated...
Date : Sat, 07 Aug 2004 10:06:30 +1200
> Alexandre Aufrere wrote:
> > Hello,
> > 
> > I am using Postgresql 7.4.2 and its JDBC drivers, straight out from a 
FC2, 
> > along with JDK 1.4.2 from Sun. 
> > I use the JDBC driver in a web app using Enhydra appserver. Java 
correctly 
> > sets its file.encoding property to the charset specified in the LANG 
> > environment variable. However, it appears that whatever i set this 
> > variable to, the JDBC driver seems to use UTF-8. 
> 
> This is entirely intentional. See below.
> 
> > I have digged into the code, and seen that in the 
> > AbstractJdbc1Connection.java class, the encoding is always forced to 
> > "UNICODE" (therefore forcing UTF-8 on Java side). 
> >>From that, i patched the code to correctly use the file.encoding 
system 
> > property to guess the charset. 
> > 
> > As i didn't dig very long, and as it seems from what i see in cvsweb 
at 
> > gborg that all this stuff could have changed deeply, i am not sure 
that 
> > this would be useful to you. However i downloaded the latest dev 
builds at 
> > jdbc.postgresql.org, and it seems the bad behaviour is still there. 
> > 
> > So, did i miss something somewhere ? Are you interested in that 
(frankly 
> > quite ugly) patch ? 
> 
> This change doesn't make sense.
> 
> The internal representation of Java strings is UTF-16 always. So it 
> doesn't really matter whether you do:
> 
>    db encoding -> UTF-8 (done by the server)
>    UTF-8 -> UTF-16 Java string (done trivially by the driver)
> 
> or:
> 
>    look up db encoding to know how to transcode
>    db encoding -> UTF-16 Java string (done by the driver)
> 
> other than if you do the second option, you have to do a lot more 
> (unnecessary) work on the driver side. Either way, you still have to 
> somehow transcode the DB data into unicode.
> 
> Using file.encoding as a basis for which encoding to use is horribly 
> broken anyway -- what if that encoding does not match the actual DB 
> charset? Whatever transcoding happens really needs to be done based on 
> the actual DB encoding in use.
> 
> I'd suggest that your real problem is that you do not have your database 
> encoding set correctly. If server_encoding is correct, then the server 
> will do the correct transcoding to UNICODE and everything will be happy 
> -- you will get correctly formed Java strings and can then encode those 
> using whatever output encoding you like. If server_encoding is 
> SQL_ASCII, everything will break horribly as the server has no idea how 
> the raw data is actually encoded and can't transcode.
> 
> If you're exclusively using JDBC to access the database, a UNICODE 
> database encoding is the right choice since it means the server does not 
> need to transcode at all when talking to JDBC. It's probably the right 
> choice even with mixed clients unless you have other clients that don't 
> understand client_encoding.
> 
> This is getting to be a FAQ -- I'm actually looking at disabling support 
> for JDBC access to SQL_ASCII databases entirely since it breaks so 
> unpredictably.
> 
> -O


In response to

Responses

pgsql-jdbc by date

Next:From: Oliver JowettDate: 2004-08-08 02:29:15
Subject: Re: Problems with charsets, investigated...
Previous:From: Oliver JowettDate: 2004-08-06 22:10:16
Subject: Re: Problems with big tables.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group