Quick Links

Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields

From:	"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk>
To:	"Johann Zuschlag" <zuschlag2(at)online(dot)de>
Cc:	"Hiroshi Inoue" <inoue(at)tpf(dot)co(dot)jp>, <pgsql-odbc(at)postgresql(dot)org>
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date:	2006-03-30 19:45:43
Message-ID:	E7F85A1B5FF8D44C8A1AF6885BC9A0E4011C9946@ratbert.vale-housing.co.uk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-odbc

> -----Original Message-----
> From: Johann Zuschlag [mailto:zuschlag2(at)online(dot)de]
> Sent: 30 March 2006 20:41
> To: Dave Page
> Cc: Hiroshi Inoue; pgsql-odbc(at)postgresql(dot)org
> Subject: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
>
> Dave Page schrieb:
> > If 'Ã¶' is 'ö', then isn't the query above mixing single
> and a multibyte encoding? Ie. It should all be single byte - e.g.
> >
> > select name from kunde where name >= 'ö' order by name asc;
> >
> > Or all multibyte (displayed byte by byte) whatever that results in:
> >
> > s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e*
> > *>*=* *'*Ã¶'*;*
> >
> > Of course, we all know how well I grok encoding issues :-)
> >
> Hi Dave,
>
> I can understand you. This encoding issues drive me also
> crazy some times. :-)
>
> The problem with UTF-8 is that all ASCII characters are
> represented by one byte and all non ASCII characters, e.g.
> German Umlauts, are represented by two bytes. That's why
> UTF-8 is called a "variable-length multibyte encoding". In a
> pure Unicode world, e.g. U+xxxx with two bytes, every
> character is represented by two bytes (fixed-length multibyte
> encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.
>
> If you like, see: http://www.utf8-chartable.de/ or some
> explanation at http://czyborra.com/utf/

Ahh, thanks for the explanation.

> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8
> for European languages. Hiroshi can you explain that? I guess
> the Japanese edition of Windows XP is using pure 2 byte Unicode.

Ahh, now I do know that Windows does not fully support UTF-8. That's the very reason why it is not supported in PostgreSQL 8.0 on Windows, and in 8.1 and above requires conversion routines that were added to the server by Magnus Hagander to convert to UCS2(?) before doing any sorting etc.

> I can't say anything about psql. But the new psqlodbc driver
> 7.03.26X seems to handle that situation very well.
>
> So I suppose the test was valid to a certain extend, since
> the characters are handled in this mixed way in Win XP. I
> still have some funny behaviour with Unicode in psql (even
> after setting LC_COLLATE correctly :-) ).
>
> For my production machines I will anyway use ISO-8859-1 (or
> ISO-8859-15). Then the driver will convert all characters to
> single byte avoiding all kind of problems.
>
> But feel free to ask me for tests... ;-)

I'll need to leave that to Hiroshi - we already know we're past my knowledge on the subject :-)

Regards, Dave.

Responses

Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields at 2006-03-31 09:22:55 from Marc Herbert

Browse pgsql-odbc by date

	From	Date	Subject
Next Message	Hiroshi Inoue	2006-03-30 21:35:12	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Previous Message	Johann Zuschlag	2006-03-30 19:41:06	Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields