Quick Links

Re: Bug in UTF8-Validation Code?

From:	"Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To:	<andrew(at)supernews(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Bug in UTF8-Validation Code?
Date:	2007-04-03 15:47:27
Message-ID:	AFCCBB403D7E7A4581E48F20AF3E5DB20203DD1F@EXADV1.host.magwien.gv.at
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Andrew wrote:
>> According to RFC 2279, the Euro,
>> Unicode code point 0x20AC = 0010 0000 1010 1100,
>> will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
>>
>> IMHO this is the only good and intuitive way for CHR() and ASCII().
>
> It is beyond ludicrous for functions like chr() or ascii() to
> convert a Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There
> is _NO SUCH THING_ as 0xE282AC as a representation of a Unicode
character
> - there is either the code point, 0x20AC (which is a _number_), or the
> sequences of _bytes_ that represent that code point in various
encodings,
> of which the three-byte sequence 0xE2 0x82 0xAC is the one used in
UTF-8.

Yes, 0xE2 0x82 0xAC is the representation in UTF-8, and UTF-8 is the
database encoding in use.

> Functions like chr() and ascii() should be dealing with the _number_
of the
> code point, not with its representation in transfer encodings.

I think that we have a fundamental difference.

As far as I know, the word "code point" is only used in UNICODE and
is the first column in the list
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

So, if I understand you correctly, you want CHR() and ASCII()
to convert between characters (in the current database encoding)
and UNICODE code points (independent of database encoding).

What I suggest (and what Oracle implements, and isn't CHR() and ASCII()
partly for Oracle compatibility?) is that CHR() and ASCII()
convert between a character (in database encoding) and
that database encoding in numeric form.

I think that what you suggest would be a useful function too,
but I certainly wouldn't call such a function ASCII() :^)

The current implementation seems closer to my idea of ASCII(),
only incomplete:

test=> select to_hex(ascii('EUR'));
to_hex
--------
e2
(1 row)

What do others think? Should the argument to CHR() be a Unicode
code point or the numeric representation of the database encoding?

Yours,
Laurenz Albe

In response to

Re: Bug in UTF8-Validation Code? at 2007-04-03 13:43:08 from Andrew - Supernews

Responses

Re: Bug in UTF8-Validation Code? at 2007-04-03 16:44:36 from Mark Dilger
Re: Bug in UTF8-Validation Code? at 2007-04-04 08:12:35 from Zeugswetter Andreas ADI SD

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2007-04-03 15:51:05	Re: PL/Python warnings in CVS HEAD
Previous Message	Mark Dilger	2007-04-03 15:47:14	Re: Bug in UTF8-Validation Code?