Re: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bugwith pgsql 7.1/jdbc and non-ascii (8-bit) chars?)

From: Barry Lind <barry(at)xythos(dot)com>
To: "Peter B(dot) West" <pbwest(at)powerup(dot)com(dot)au>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bugwith pgsql 7.1/jdbc and non-ascii (8-bit) chars?)
Date: 2001-05-08 21:14:46
Message-ID: 3AF861C6.9090705@xythos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-jdbc

Peter B. West wrote:

> I'm not entirely sure of the situation here, although I have been
> reading the thread as it has unwound. Given that I may not understand
> the whole situation, my *philosophical* preference is NOT to build in
> kludges which silently bypass the information which is being passed
> around.
>
> Initially, I was getting wound up about Latin1 imperialism, but I
> realised that, for SQL_ASCII encoding to work in 8-bit environments up
> to now, users must be working in homogeneous encoding environments,
> where 8 bits coming and going will always represent the same character.
> In that case it doesn't matter how the character is represented
> internally as long as the round-trip translation is consistent.
>
> How hard is it to change the single-byte character encoding of a
> database? If that is currently difficult, why not provide a one-off
> upgrade application which does just that, provided it is going from
> SQL_ASCII to a single-byte encoding?

It is currently not possible to change the character encoding of a
database once created. You can specify a character encoding for a newly
created database only if multibyte is enabled. The code hardcodes a
value of 'SQL_ACSII' if multibyte is not enabled. How difficult would
it be to change this functionality is a question more appropriately
answered by others on the list (i.e. I don't know).

>
> Alternatively, add a compile switch that specifies an implicit 8-bit
> encoding in which 8-bit SQL_ASCII values are to be understood? I think
> that the first solution should be as easy to implement, and would be a
> lot cleaner.
>
> Peter
>
I agree that your first suggestion would be more desirable IMHO.

thanks,
--Barry

>
> Barry Lind wrote:
>
>> Tatsuo Ishii wrote:
>>
>>>>> Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or
>>>>> something similar when in fact it doesn't know what the encoding is
>>>>> (i.e. when not compiled with multibyte).
>>>>
>>> Is that ok for Java? I thought Java needs to know the encoding
>>> beforehand so that it could convert to/from Unicode.
>>
>> That is actually the original issue that started this thread. If you
>> want the full thread see the jdbc mail archive list. A user was
>> complaining that when running on a database without multibyte enabled,
>> that through psql he could insert and retrieve 8bit characters, but in
>> jdbc the 8bit characters were converted to ?'s.
>>
>> I then explained why this was happening (db returns SQL_ASCII as the db
>> character set when not compiled with multibyte) so that character set is
>> used to convert to unicode.
>>
>> Tom suggested that it would make more sense for jdbc to use LATIN1 when
>> the database reported SQL_ASCII so that most users will see 'correct'
>> behavior in a non multibyte database. Because currently you need to
>> enable multibyte support in order to use 8bit characters with jdbc.
>> Jdbc could easily be changed to treat SQL_ASCII as LATIN1, but I don't
>> feel that is an appropriate solution for the reasons outlined in this
>> thread (thus the suggestions for UNKNOWN, or the ability for the client
>> to determine if multibyte is enabled or not).
>>
>>>> I have a philosophical difference with this: basically, I think that
>>>> since SQL_ASCII is the default value, you probably ought to assume that
>>>> it's not too trustworthy. The software can *never* be said to KNOW what
>>>> the data encoding is; at most it knows what it's been told, and in the
>>>> case of a default it probably hasn't been told anything. I'd argue that
>>>> SQL_ASCII should be interpreted in the way you are saying "UNKNOWN"
>>>> ought to be: ie, it's an unspecified 8-bit encoding (and from there
>>>> it's not much of a jump to deciding to treat it as LATIN1, if you're
>>>> forced to do conversion to Unicode or whatever). Certainly, seeing
>>>> SQL_ASCII from the server is not license to throw away data, which is
>>>> what JDBC is doing now.
>>>>
>>>>> PS. Note that if multibyte is enabled, the functionality that is being
>>>>> complained about here in the jdbc client is apparently ok for the server
>>>>> to do. If you insert a value into a text column on a SQL_ASCII database
>>>>> with multibyte enabled and that value contains 8bit characters, those
>>>>> 8bit characters will be quietly replaced with a dummy character since
>>>>> they are invalid for the SQL_ASCII 7bit character set.
>>>>
>>>> I have not tried it, but if the backend does that then I'd argue that
>>>> that's a bug too.
>>>
>>>
>>> I suspect the JDBC driver is responsible for the problem Burry has
>>> reported (I have tried to reproduce the problem using psql without
>>> success).
>>>
>>> >From interfaces/jdbc/org/postgresql/Connection.java:
>>>
>>>> if (dbEncoding.equals("SQL_ASCII")) {
>>>> dbEncoding = "ASCII";
>>>
>>>
>>> BTW, even if the backend behaves like that, I don't think it's a
>>> bug. Since SQL_ASCII is nothing more than an ascii encoding.
>>
>> I believe Tom's point is that if multibyte is not enabled this isn't
>> true, since SQL_ASCII then means whatever character set the client wants
>> to use against the server as the server really doesn't care what single
>> byte data is being inserted/selected from the database.
>>
>>>> To my mind, a MULTIBYTE backend operating in
>>>> SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
>>>> transparent pass-through of characters with the high bit set. But I'm
>>>> not a multibyte guru. Comments anyone?
>>>
>>>
>>> If you expect that behavior, I think the encoding name 'UNKNOWN' or
>>> something like that seems more appropreate. (SQL_)ASCII is just an
>>> ascii IMHO.
>>
>> I agree.
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-05-08 21:20:05 Re: Outstanding patches
Previous Message Tom Lane 2001-05-08 20:35:26 Re: New tests for new bugs (was Re: [BUGS] Re: backend dies on 7.1.1 loading large datamodel.)

Browse pgsql-jdbc by date

  From Date Subject
Next Message Tom Lane 2001-05-08 21:20:05 Re: Outstanding patches
Previous Message Jeremy Buchmann 2001-05-08 20:28:26 Re: "No results" exception on insert