Skip site navigation (1) Skip section navigation (2)

Re: Problems with charsets, investigated...

From: Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To: oliver(at)opencloud(dot)com
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: Problems with charsets, investigated...
Date: 2004-08-08 08:18:54
Message-ID: 20040808081854.793B0400E5@smtp.ies.inet6.fr (view raw or flat)
Thread:
Lists: pgsql-jdbc
----------------------------------------------------
De : Oliver Jowett <oliver(at)opencloud(dot)com>
A : alexandre(dot)aufrere(at)inet6(dot)fr
Objet : Re: [JDBC] Problems with charsets, investigated...
Date : Sun, 08 Aug 2004 14:29:15 +1200
> > 5) Java's file.encoding system property is set to ISO-8859-1 (because 
we 
> > have other data coming from LDAP or filesystem, which are encoded in 
> > ISO-8859-1 anyway) 
> > 6) Our web app choses to display Java Strings accordingly to 
> > file.encoding, therefore as ISO-8859-1 
> > 7) Bing ! problem: we are now interpreting UTF8-encoded strings (see 
point 
> > 2/3) as ISO-8859-1 
> > Therefore all the accentuated characters go wrong !
> 
> This implies that your web app is not transcoding correctly from UTF-16 
> (internal string representation) to ISO-8859-1.

ok, but to test, we simply do a debug output (nothing more than a 
System.out) of the strings. normally it's java itself that does the 
transcoding there accordingly to the environment variables, no ? moreover, 
if we read a string from filesystem or LDAP, ISO-8859-1-encoded, it is 
displayed correctly in the debug output. 

> How does your web app use file.encoding exactly? Note that the 
> file.encoding property does *not* control the default encoding used by 
> String.getBytes(), as I understand it; the default eencoding is 
> JVM-controlled from the system's locale settings.

all system locale settings (ie LANG/LC_* environment variables) are 
correctly set to en_US.iso-8859-1. file.encoding property normally only 
reflects that. 
 
> > In all previous versions of the JDBC driver (we started with the one 
> > coming along with postgresql 7.0 series) coupled with the 
corresponding 
> > version of postgresql, the data was correctly retrieved. 
> 
> I think this is luck of the draw more than anything..
> 
> > Now, a working work-around looks like:
> > String correctString = new 
> > String(stringFromJdbcDriver.getBytes("ISO-8859-1"),"UTF-8"); 
> 
> This doesn't make sense at all! This means you are interpreting 
> ISO-8859-1 encoded bytes as UTF-8, which is nonsense.

it makes sense if, when inputting to java, UTF-8 strings were presented to 
java as ISO-8859-1: as both are 8-bits charsets, an UTF-8 strings 
technically makes sense in ISO-8859-1 encoding. 
for instance, the word 'mère', encoded as UTF-8 and displayed as 
ISO-8859-1 will give sthg like 'mÃ"re'. then java transcode that thing 
into UTF-16, thinking that it's ISO-8859-1, when it's actually UTF-8. that 
ugly work-around simply does the reverse job. 

> > My patch eliminates the problem, because the JDBC driver gets 
ISO-8859-1 
> > (aka LATIN1) strings from the server, therefore java internal 
transcoding 
> > into UTF-16 goes ok... 
> 
> It's still the wrong thing to do! I'm sure there is another bug here 
> that is causing the underlying problem. There should be no problem with 
> converting from client_encoding = UNICODE to Java's UTF-16.

yes, i agree it sounds extremely strange ! however the problem seems to 
stand there. 

> What driver version *exactly* are you using? It's possible that you've 
> hit a driver bug of some sort that is fixed in the current driver 
> (specifically, I think build 302 was broken wrt. UTF-8 conversions -- 
> but it was only available briefly). Have you tried with the current 
> development driver from jdbc.postgresql.org?

as i've said in my first posts, i'm using the driver that comes along with 
FC2, and i've tried all the drivers available on jdbc.postgresql.org 

> Can you show me the code your web app uses to display the Strings it 
> gets from the driver in ISO-8859-1?
> 
> Can you dump out the *characters* of the problem Strings you get from 
> the driver, one character at a time, and see what numeric values you're 
> getting and whether they are the right UTF-16 values you expect? i.e.
> 
>   for (int i = 0; i < str.length(); ++i) {
>    System.out.println(" offset " + i + " value " + (int)str.charAt(i));
>   }
> 
> Can you provide a pg_dump (LATIN1 encoding I assume) plus sample 
> testcase that shows off the problem?

well, i'll investigate more tomorrow, at work, and try to set up a simple 
test program to try to understand deeper what's going on. 
currently, we see the problem by doing a debug output (simply a 
System.out) from Enhydra's DODS (which is the relational-object layer). 
>From what i've seen in DODS (maybe, though, i didn't dig enough), DODS 
does not manipulate Strings coming from the JDBC driver when they are of 
type VARCHAR, therefore it shouldn't be the source of the problem. 
about the charAt thing, it is as well not correct, i tried...

Thank you for your advices and time,

Alexandre Aufrere


In response to

Responses

pgsql-jdbc by date

Next:From: Oliver JowettDate: 2004-08-08 11:57:02
Subject: Re: Problems with charsets, investigated...
Previous:From: Alexandre AufrereDate: 2004-08-08 08:04:02
Subject: Re: Problems with charsets, investigated...

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group