Re: Charset encoding and accents

From: Barry Lind <blind(at)xythos(dot)com>
To: Davide Romanini <romaz(at)libero(dot)it>
Cc: pgsql-hackers(at)postgresql(dot)org, PostgreSQL JDBC <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Charset encoding and accents
Date: 2003-04-12 02:49:39
Message-ID: 3E977EC3.4090401@xythos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-jdbc

Davide Romanini wrote:
> Barry Lind ha scritto:
>
>> The charSet= option will no longer work with the 7.3 driver talking to
>> a 7.3 server, since character set translation is now performed by the
>> server (for performance reasons) in that senario.
>>
>> The correct solution here is to convert the database to the proper
>> character set for the data it is storing. SQL_ASCII is not a proper
>> character set for storing 8bit data.
>>
>
> Probably I'm not enough clear about the problem. I *cannot* change
> charset type. SQL_ASCII really *is* the proper character set for my
> porpuses, because I actually work using psql and ODBC driver without any
> problem.

You were clear, however we disagree. SQL_ASCII is *not* the proper
character set for your purposes. The characters you are having problems
with do not exist in the SQL_ASCII character set. The fact that psql
and ODBC work under this misconfiguration doesn't mean that the
configuration is correct. Java deals with all characters internally in
unicode thus forcing a character set conversion. So the code is
converting from SQL_ASCII to UTF8. When it finds characters that are
not part of SQL_ASCII character set it doesn't know what to do with them
(are they LATIN1, LATIN5, LATIN? characters).

You state that you "*cannot* change" the character set. Can you explain
why this is the case?

> I repeat: psql and ODBC retrives all data (with the accents) in
> the correct manner. Also, if I change the
> org.postgresql.core.Encoding.java making the decodeUTF8 method to return
> simply a new String(data), JDBC retrives the data from my SQL_ASCII
> database correctly! So my question is: why JDBC calls the decodeUTF8
> method also when the string is surely *not* an UTF-8 string?

If you were only storing SQL_ASCII characters it would be a UTF8 string
since SQL_ASCII is a subset of UTF8. But since you are storing invalid
SQL_ASCII characters this is no longer true.

The logic is as follows:
The driver sets the CLIENT_ENCODING parameter to UNICODE which instructs
the server to convert from the character set of the database to UTF8.

The server then sends all data to the client encoded in UTF8.

The jdbc driver reads the UTF8 data and converts it to java's internal
unicode representation.

The problem in all of this is that the server has decided as an
optimization that if the database character set is SQL_ASCII then no
conversion is necessary to UTF8 since SQL_ASCII is a proper subset of
UTF8. However when characters that are not SQL_ASCII are stored in the
database (i.e. 8bit characters) then this optimization simply sends them
on to the client as if they were valid UTF8 characters (which they are
not). So the client then tries to read what are supposed to be UTF8
characters and fails because it is receiving non UTF8 data even though
it asked the server to only send it UTF8 data.

> If jdbc could recognize that the string is *not* an UTF-8 string, then it will
> simply return a new String that is the right thing to do.
> It's obvious that if JDBC receives from postgresql server a byte array
> representing a non-UTF8 string, and it a calls e method that wants as a
> parameter a byte array representing an UTF8 string, then it is a *bug*,
> because for non-UTF8 strings it must return a new String.
>

As stated above the driver tells the server to send all data as UTF8,
but because of the optimization and the non-SQL_ASCII characters you are
storing that optimization results in non-UTF8 data being sent to the
client.

> I hope to be enough clear this time.

As I said ealier you were clear the first time. I hope I have been more
clear in my response to explain the issues in greater detail.

>
> Sincerely, I'm getting a bit frustrated from the problem, because I've
> projects to do and it prevents me to do that projects :-(

I understand that you are frustrated, but frankly I am frustrated too,
because I keep telling you what the solution to your problem is and you
keep ignoring it :-)

thanks,
--Barry

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bob Kline 2003-04-12 03:06:22 Re: Upgrade to RedHat 9.0 broke PostgreSQL
Previous Message Ron Peacetree 2003-04-12 02:33:04 Re: No merge sort?

Browse pgsql-jdbc by date

  From Date Subject
Next Message Cristina Surroca 2003-04-12 10:43:42 Index information and log disable...
Previous Message Barry Lind 2003-04-12 02:18:39 Re: Version of driver for 7.3 postgresql