Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text

From: Bart Samwel <bart(at)samwel(dot)tk>
To: Johann Zuschlag <zuschlag2(at)online(dot)de>
Cc: Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, pgsql-odbc(at)postgresql(dot)org
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date: 2006-03-30 21:36:44
Message-ID: 442C4F6C.2000607@samwel.tk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-odbc

Johann Zuschlag wrote:
> The problem with UTF-8 is that all ASCII characters are represented by
> one byte and all non ASCII characters, e.g. German Umlauts, are
> represented by two bytes. That's why UTF-8 is called a "variable-length
> multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two
> bytes, every character is represented by two bytes (fixed-length
> multibyte encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.

Well, it's actually even more complicated, because Unicode is actually a
32-bit character set. There is actually UTF8 (variable-length multibyte,
8 bits per unit), UTF16 (variable-length multibyte) and UTF32
(fixed-length multibyte). There is also UCS2 (fixed-length 16-bit),
which is limited to the 16 bits of the Basic Multilingual Plane, and
UCS4, which is functionally identical to UTF32. UTF-8 actually supports
up to 4 bytes per character, so it is more complete than the purely
16-bit UCS-2. Any of the variable-length encodings, and the 32-bit
UTF-32 and UCS-4 encodings can represent the whole of the character set.
A pure Unicode world can use any of those encodings, so it's a tradeoff.
If you want a direct relationship between the number of characters in a
string and the number of bytes taken, use a fixed-length encoding. If
you want to be able to encode everything, use a variable-length encoding
or a 32-bit encoding. If you want to use little space, use an 8-bit
encoding. That's it.

> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8 for European
> languages. Hiroshi can you explain that? I guess the Japanese edition of
> Windows XP is using pure 2 byte Unicode.

In fact, the Win32 API is UTF-16 even in European languages(started out
as UCS-2 but became UTF-16 when Unicode went 32-bit :-) ), but it
provides an 8-bit compatibility interface. Don't know if te 8-bit
encoding is UTF-8 or plain 8-bit code pages though.

Reference: http://en.wikipedia.org/wiki/Unicode

Cheers,
Bart

In response to

Browse pgsql-odbc by date

  From Date Subject
Next Message Marc Herbert 2006-03-31 09:22:55 Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Previous Message Hiroshi Inoue 2006-03-30 21:35:12 Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields