Re: Bug or not about ASCII and Multi-Byte character set

From: Marc Herbert <Marc(dot)Herbert(at)emicnetworks(dot)com>
To: pgsql-odbc(at)postgresql(dot)org
Subject: Re: Bug or not about ASCII and Multi-Byte character set
Date: 2005-08-19 18:05:03
Message-ID: 20050819180503.GK16062@emicnetworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-odbc

On Fri, Aug 19, 2005 at 04:11:48PM +0200, Andreas Pflug wrote:
> Marc Herbert wrote:
>
> >If SQL_ASCII is/was equivalent to "ignoring encoding", then it
> >looks/looked pretty misnamed!
> >
> It's not. It should be used for ASCII only, but the database system will
> not barf if you offer it a byte with the upper bit set. You're simply on
> your own.

Well this still looks like what I called a "BINARY/don't touch it"
accidental mode.


> >Encoding ignorance should rather be called SQL_BINARY. A BINARY setting
> >for strings makes sense, just like when transfering text files using
> >FTP: you just don't trust FTP for encodings and use it like a
> >filesystem. BINARY just means that: "don't mess-up with encodings and
> >let something else deal with the issue".
> >

> No, binary would include 0x00

This seems irrelevant to me, see below.

> which is definitely *not* a character but the string terminator.

Not everyone in the world uses 0x00 as a string terminator. C does,
Postgres also, but Java does not and I don't think databases standards
and even less encoding standards say anything about this (please prove
me wrong, I'd really like to have a definitive answer on this).

It just tried to insert a string into hsqldb using JDBC and it worked
perfectly fine. Postgres JDBC driver is also "strings with
null-character"-ready, so this seems to be only a limitation of
Postgres.

By the way many ODBC function calls ask for the length of string
arguments, _optionally_ being SQL_NTS (Null-Terminated String). So it
seems some people here catered for strings with null characters even
in C!

In any case whether 0x00 is The String Terminator or not is not
relevant to the fact that there was a accidental "BINARY" string
encoding before. If we learn that 0x00 is really The Database String
Terminator, then it can also be interpreted as a terminator even in
"encoding ignorance" mode since it translates into 0x00 for every
known encoding.

> >I guess some people knew what they did and simply did not mixed
> >driver/apps, or in a way they mastered and that worked.

> The latter, with the obvious chance to break if the next app accesses
> the data. This is certainly not the design goal of a RDBMS.

There was a time, not so long ago, where every encoding-related stuff
was under-specified, every software buggy etc., so people had to cope
with it. They were probably pleased at that time to have this
accidental "BINARY" workaround available. One can easily understand
that they complain a little bit about the sudden removal of this
workaround and the unplanned migration to The Right Solution.

Of course on the other hand everyone can understand that the Postgres
developers want to get rid of this accidental BINARY string mode, and
that they are free to do what they want.


> >Well while reading at the complaints it seems this BINARY mode was
> >there before (by "accident"?),

> No.

Well, I am still waiting for some proof of the opposite (since this
0x00 stuff does not seem really related to it).

I was just reformulating Tom Lanes "SQL_ASCII ignorance" quote above,
which looked quite informed.

> >PS: BTW "unicode" is not one encoding but many different ones.

> Doesn't matter. Always means the current Unicode for the system: in the
> backend UTF-8, on Win32 UCS16, Linux UCS32 or UTF-8 dependent on
> interface definition.

Interesting. I hope this "current unicode for the system" concept is
well documented, because just saying "unicode" is not clear at all,
even if not ambiguous.

Regards,

Marc.

In response to

Browse pgsql-odbc by date

  From Date Subject
Next Message Dave Page 2005-08-19 19:40:50 "Official" version
Previous Message Andreas Pflug 2005-08-19 14:11:48 Re: Bug or not about ASCII and Multi-Byte character set