Skip site navigation (1) Skip section navigation (2)

Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields

From: Johann Zuschlag <zuschlag2(at)online(dot)de>
To: Dave Page <dpage(at)vale-housing(dot)co(dot)uk>
Cc: Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, pgsql-odbc(at)postgresql(dot)org
Subject: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date: 2006-03-30 19:41:06
Message-ID: 442C3452.5020704@online.de (view raw or flat)
Thread:
Lists: pgsql-odbc
Dave Page schrieb:
> If 'ö' is 'ö', then isn't the query above mixing single and a multibyte encoding? Ie. It should all be single byte - e.g.
>
> select name from kunde where name >= 'ö' order by name asc;
>
> Or all multibyte (displayed byte by byte) whatever that results in:
>
> s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e* *>*=* *'*ö'*;*
>
> Of course, we all know how well I grok encoding issues :-)
>   
Hi Dave,

I can understand you. This encoding issues drive me also crazy some 
times. :-)

The problem with UTF-8 is that all ASCII characters are represented by 
one byte and all non ASCII characters, e.g. German Umlauts, are 
represented by two bytes. That's why UTF-8 is called a "variable-length 
multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two 
bytes, every character is represented by two bytes (fixed-length 
multibyte encoding). So Unicode is not equal to UTF-8, even though the 
PostgreSQL documentation is stating that.

If you like, see: http://www.utf8-chartable.de/ or some explanation at 
http://czyborra.com/utf/

Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian. 
Unfortunately (or fortunately?) Windows seems to use UTF-8 for European 
languages. Hiroshi can you explain that? I guess the Japanese edition of 
Windows XP is using pure 2 byte Unicode.

I can't say anything about psql. But the new  psqlodbc driver 7.03.26X 
seems to handle that situation very well.

So I suppose the test was valid to a certain extend, since the 
characters are handled in this mixed way in Win XP. I still have some 
funny behaviour with Unicode in psql (even after setting LC_COLLATE 
correctly :-) ).

For my production machines I will anyway use ISO-8859-1 (or 
ISO-8859-15). Then the driver will convert all characters to single byte 
avoiding all kind of problems.

But feel free to ask me for tests... ;-)

Regards,
Johann


In response to

Responses

pgsql-odbc by date

Next:From: Dave PageDate: 2006-03-30 19:45:43
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Previous:From: Åsmund Kveim LieDate: 2006-03-30 12:35:19
Subject: Error when getting text longer than MaxLongVarcharSize

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group