Re: String encoding during connection "handshake"

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: <sulfinu(at)gmail(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: String encoding during connection "handshake"
Date: 2007-11-28 18:57:17
Message-ID: 87prxujj02.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

<sulfinu(at)gmail(dot)com> writes:

> Yes, you support (and worry about) encodings simply because of a C limitation
> dating from 1974, if I recall correctly...
> In Java, for example, a "char" is a very well defined datum, namely a Unicode
> point. While in C it can be some char or another (or an error!) depending on
> what encoding was used.

No, you're being confused by C's idiosyncratic terminology. "char" in C just
means 1-byte integral data type. If you want to store a unicode code point you
use a different data type.

Incidentally I'm not sure but I don't think it's true that "char" in Java
stores a unicode code point. I thought Java used UTF16 internally for strings
and strings stored arrays of chars. In which case "char" in Java stores two
bytes of a UTF16 encoded string which is pretty analogous to storing UTF8
encoded strings in C where each "char" stores one byte of a UTF8 encoded
string.

> Think of it this way: if I give you a Java String you will perfectly know what
> I meant; if I send you a C char* you don't know what it is in the absence of
> extra information - you can even use it as a uint8*, as it is actually done
> in md5.c.

That's because you're comparing apples to oranges. In C you don't even know if
a char* is a string at all. It's a pointer to some bytes and those could
contain anything.

And think about what happens in Java if you have to deal with UTF8 encoded
strings or Big5 encoded strings. They aren't "strings" in the Java object
hierarchy so when someone passes you a "MyString" you have the same problems
of needing to know what encoding was used. Presumably you would put that in a
member variable of the MyString class but that just goes to how the data
structures in C are laid out and what you're considering "extra information".

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2007-11-28 19:00:48 Re: [HACKERS] plperl and regexps with accented characters - incompatible?
Previous Message Joshua D. Drake 2007-11-28 18:56:37 Re: [HACKERS] Time to update list of contributors