Re: Bug in UTF8-Validation Code?

From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mark Dilger *EXTERN*" <pgsql(at)markdilger(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 07:40:02
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB20203E26F@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger wrote:
>> What I suggest (and what Oracle implements, and isn't CHR() and
ASCII()
>> partly for Oracle compatibility?) is that CHR() and ASCII()
>> convert between a character (in database encoding) and
>> that database encoding in numeric form.
>
> Looking at Oracle documentation, it appears that you get different
> behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it

> with the argument USING NCHAR_CS. Oracle 9i and higher have an
> additional function called NCHR(X) which is supposed to be the same as

> CHR(X USING NCHAR_CS).
>
> On http://www.oraclehome.co.uk/chr-function.htm it says that "To use
> UTF8, you specify using nchar_cs in the argument list". Does this
mean
> that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe
> Laurenz wants? Vice versa?

That web page is misleading at least, if not downright wrong.

It's just that an Oracle database has 2 character sets, a "database
character set" and a "national character set", the latter always being a
UNICODE encoding (the name "national character set" is somewhat
misleading).

This baroque concept is from those days when nobody had a UNICODE
database, but people still wanted to store characters not supported
by the "database character set" - in that case you could define a column
to be in the "national character set".

CHR(n) and CHR(n USING NCHAR_CS) = NCHR(n) are the same function, only
that the first one uses the "database character set" and the latter ones
the "national character set".

Nowadays this Oracle concept of "national character set" is nearly
obsolete, one normally uses a UNICODE "database character set".

Oracle has two things to say about CHR():

"For single-byte character sets, if n > 256, then Oracle Database
returns the binary equivalent of n mod 256. For multibyte character
sets, n must resolve to one entire code point. Invalid code points
are not validated, and the result of specifying invalid code points
is indeterminate."

It seems that Oracle means "encoding" when it says "code point" :^)
We should of course reject invalid arguments!
I don't know if I like this modulus thing for single byte encodings
or not...

"Use of the CHR function (either with or without the optional USING
NCHAR_CS clause) results in code that is not portable between ASCII-
and EBCDIC-based machine architectures."

There's one thing that strikes me as weird in your implementation:

> pgsql=# select chr(0);
> ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in
"UTF8"

0x00 is a valid UNICODE code point and also a valid UTF-8 character!

To me (maybe only to me) CHR() and ASCII() have always had the look
and feel of "type casts" between "char" and integer, with all the lack
of portability this might imply.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zeugswetter Andreas ADI SD 2007-04-04 08:12:35 Re: Bug in UTF8-Validation Code?
Previous Message Martijn van Oosterhout 2007-04-04 06:01:56 Re: Bug in UTF8-Validation Code?