Skip site navigation (1) Skip section navigation (2)

Re: Bug in UTF8-Validation Code?

From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mark Dilger *EXTERN*" <pgsql(at)markdilger(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 09:43:21
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB201FF8372@EXADV1.host.magwien.gv.at (view raw or flat)
Thread:
Lists: pgsql-hackers
Mark Dilger wrote:
>>> In particular, in UTF8 land I'd have expected the argument of chr()
>>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>>> with a randomly-chosen endianness.
>>>
>>> Not sure what to do in other multibyte encodings.
>> 
>> "Not sure what to do in other multibyte encodings" was pretty much my

>> rationale for this particular behavior.  I standardized on network
byte 
>> order because there are only two endianesses to choose from, and the 
>> other seems to be a more surprising choice.
> 
> Since chr() is defined in oracle_compat.c, I decided to look 
> at what Oracle might do.  See 
>
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/func
tions18a.htm
> 
> It looks to me like they are doing the same thing that I did,
> though I don't have Oracle installed anywhere to verify that.
> Is there a difference?

This is Oracle 10.2.0.3.0 ("latest and greatest") with UTF-8 encoding
(actually, Oracle chooses to call this encoding AL32UTF8):

SQL> SELECT ASCII('EUR') AS DEC,
  2         TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX
  3  FROM DUAL;

       DEC HEX
---------- ----------------------------
  14844588  E282AC

SQL> SELECT CHR(14844588) AS EURO FROM DUAL;

EURO
----
EUR

I don't see how endianness enters into this at all - isn't that just
the question of how a byte is stored physically?

According to RFC 2279, the Euro,
Unicode code point 0x20AC = 0010 0000 1010 1100,
will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.

IMHO this is the only good and intuitive way for CHR() and ASCII().

Yours,
Laurenz Albe

In response to

Responses

pgsql-hackers by date

Next:From: Koichi SuzukiDate: 2007-04-03 10:14:54
Subject: Re: [HACKERS] Full page writes improvement, code update again.
Previous:From: Hiroshi SaitoDate: 2007-04-03 09:16:26
Subject: Re: PthreadGC2 of MinGW is not linked.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group