Re: String encoding during connection "handshake"

From: sulfinu(at)gmail(dot)com
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: String encoding during connection "handshake"
Date: 2007-11-28 18:17:53
Message-ID: 200711282017.53764.sulfinu@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday 28 November 2007, Alvaro Herrera wrote:
> sulfinu(at)gmail(dot)com escribió:
> > Martijn,
> >
> > :) don't take it personal, I am just trying to obtain confirmation that I
> >
> > understood well the problem. Afterall, it's just that C has a very
> > outdated notion of "char"s (and no notion of Unicode). I was naively
> > under the impression that "char"s have evolved in nowadays C.
>
> This is not the language's fault in any way. We support plenty of
> encodings beyond UTF-8.
Yes, you support (and worry about) encodings simply because of a C limitation
dating from 1974, if I recall correctly...
In Java, for example, a "char" is a very well defined datum, namely a Unicode
point. While in C it can be some char or another (or an error!) depending on
what encoding was used. The only definition that stands up is that a "char"
is a byte. Its interpretation is unsure and unsafe (see my original problem).

On Wednesday 28 November 2007, Martijn van Oosterhout wrote:
> On Wed, Nov 28, 2007 at 05:54:05PM +0200, sulfinu(at)gmail(dot)com wrote:
> > Regarding the problem of "One True Encoding", the answer seems obvious to
> > me: use only one encoding per database cluster, either UTF-8 or UTF-16 or
> > another Unicode-aware scheme, whichever yields a statistically smaller
> > database for the languages employed by the users in their data. This
> > encoding should be a one time choice! De facto, this is already happening
> > now, because one cannot change collation rules after a cluster has been
> > created.
>
> Umm, each database in a cluster can have a different encoding, so there
> is no such thing as the "cluster's encoding".
I implied that a cluster should have a single encoding that covers the whole
Unicode set. That would certainly satisfy everybody.

> You can certainly argue
> that it should be a one time choice, but I doubt you'll get people to
> remove the possibilites we have now. If fact, if anything we'd probably
> go the otherway, allow you to select the collation on a per
> database/table/column level (SQL complaince requires this).
The collation order is implemented in close relationship with the byte
representation of strings, but conceptually depends on the locale solely and
has nothing to do with the encoding.

> This has nothing to do with C by the way. C has many features that
> allow you to work with different encodings. It just doesn't force you
> to use any particular one.
Yes, my point exactly! C forces you to worry about encoding. I mean, if you're
not an ASCII-only user ;)

Think of it this way: if I give you a Java String you will perfectly know what
I meant; if I send you a C char* you don't know what it is in the absence of
extra information - you can even use it as a uint8*, as it is actually done
in md5.c.

I consider this matter closed from my point of view and I have modified the
JDBC driver according to my needs.
Thank you all for the help.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2007-11-28 18:21:26 Re: [HACKERS] Time to update list of contributors
Previous Message Andrew Dunstan 2007-11-28 18:15:52 Re: [HACKERS] Time to update list of contributors