Skip site navigation (1) Skip section navigation (2)

Re: String encoding during connection "handshake"

From: sulfinu(at)gmail(dot)com
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: String encoding during connection "handshake"
Date: 2007-11-28 18:17:53
Message-ID: 200711282017.53764.sulfinu@gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On Wednesday 28 November 2007, Alvaro Herrera wrote:
> sulfinu(at)gmail(dot)com escribió:
> > Martijn,
> >
> > :) don't take it personal, I am just trying to obtain confirmation that I
> >
> > understood well the problem. Afterall, it's just that C has a very
> > outdated notion of "char"s (and no notion of Unicode). I was naively
> > under the impression that "char"s have evolved in nowadays C.
>
> This is not the language's fault in any way.  We support plenty of
> encodings beyond UTF-8.
Yes, you support (and worry about) encodings simply because of a C limitation 
dating from 1974, if I recall correctly...
In Java, for example, a "char" is a very well defined datum, namely a Unicode 
point. While in C it can be some char or another (or an error!) depending on 
what encoding was used. The only definition that stands up is that a "char" 
is a byte. Its interpretation is unsure and unsafe (see my original problem).

On Wednesday 28 November 2007, Martijn van Oosterhout wrote:
> On Wed, Nov 28, 2007 at 05:54:05PM +0200, sulfinu(at)gmail(dot)com wrote:
> > Regarding the problem of "One True Encoding", the answer seems obvious to
> > me: use only one encoding per database cluster, either UTF-8 or UTF-16 or
> > another Unicode-aware scheme, whichever yields a statistically smaller
> > database for the languages employed by the users in their data. This
> > encoding should be a one time choice! De facto, this is already happening
> > now, because one cannot change collation rules after a cluster has been
> > created.
>
> Umm, each database in a cluster can have a different encoding, so there
> is no such thing as the "cluster's encoding". 
I implied that a cluster should have a single encoding that covers the whole 
Unicode set. That would certainly satisfy everybody.

> You can certainly argue 
> that it should be a one time choice, but I doubt you'll get people to
> remove the possibilites we have now. If fact, if anything we'd probably
> go the otherway, allow you to select the collation on a per
> database/table/column level (SQL complaince requires this).
The collation order is implemented in close relationship with the byte 
representation of strings, but conceptually depends on the locale solely and 
has nothing to do with the encoding.

> This has nothing to do with C by the way. C has many features that
> allow you to work with different encodings. It just doesn't force you
> to use any particular one.
Yes, my point exactly! C forces you to worry about encoding. I mean, if you're 
not an ASCII-only user ;)

Think of it this way: if I give you a Java String you will perfectly know what 
I meant; if I send you a C char* you don't know what it is in the absence of 
extra information - you can even use it as a uint8*, as it is actually done 
in md5.c.

I consider this matter closed from my point of view and I have modified the 
JDBC driver according to my needs.
Thank you all for the help.

In response to

Responses

pgsql-hackers by date

Next:From: Joshua D. DrakeDate: 2007-11-28 18:21:26
Subject: Re: [HACKERS] Time to update list of contributors
Previous:From: Andrew DunstanDate: 2007-11-28 18:15:52
Subject: Re: [HACKERS] Time to update list of contributors

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group