Re: Unicode support

From: Marko Ristola <Marko(dot)Ristola(at)kolumbus(dot)fi>
To: Marc Herbert <Marc(dot)Herbert(at)emicnetworks(dot)com>, pgsql-odbc(at)postgresql(dot)org
Subject: Re: Unicode support
Date: 2005-09-08 17:22:50
Message-ID: 4320736A.10308@kolumbus.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-odbc


Marc Herbert wrote:

>Marko Ristola <Marko(dot)Ristola(at)kolumbus(dot)fi> writes:
>
>
>>So I ask you, how you have thought about these things:
>>
>>If I have understood Windows correctly, it uses UCS-2 as it's internal
>>UNICODE
>>character set. Linux prefers into UTF-8.
>>
>>
>
>I am not sure what you mean by "internal UNICODE character set", but I
>understand that Linux does prefer UTF-32, NOT UTF-8 !
>
>
>

If you want to know the details about UTF-8's encoding, the following
is a recommended reading (Linux manual page) :)

man utf-8

It gives you a good explanation of the encoding used in UTF-8.

UTF-8 uses from one to four bytes per character.
It supports almost all character sets in the World.

Because the task is so huge, there exist variants and bugs in
the implementations. That's what I read from Samba filesystem
FAQ.

So, if you stick with Windows implementation, you don't find
any bugs, but when you move the file into another operating system,
the file might look different :(

UCS-2 is a 32-bit Unicode wchar_t type. According to
Linux manuals, wchar_t is not equal on all implementations.
According to manuals, inside binary files, it is recommended in C
to use UTF-8 strings, that are then converted at runtime into
wchar_t type. Java language is another story. There might
be same problems though. The number remains the same, but
if you try to draw the character into the window with
different implementations, you might get different drawings.

>On all platforms I had a look at, variable-length encodings are only
>for disk and network, never used in memory.
>
>Don't you agree?
>
>
> locale
LANG=fi_FI(dot)UTF-8(at)euro
LC_CTYPE="fi_FI(dot)UTF-8(at)euro"
LC_NUMERIC="fi_FI(dot)UTF-8(at)euro"
LC_TIME="fi_FI(dot)UTF-8(at)euro"
LC_COLLATE="fi_FI(dot)UTF-8(at)euro"
LC_MONETARY="fi_FI(dot)UTF-8(at)euro"
LC_MESSAGES="fi_FI(dot)UTF-8(at)euro"
LC_PAPER="fi_FI(dot)UTF-8(at)euro"
LC_NAME="fi_FI(dot)UTF-8(at)euro"
LC_ADDRESS="fi_FI(dot)UTF-8(at)euro"
LC_TELEPHONE="fi_FI(dot)UTF-8(at)euro"
LC_MEASUREMENT="fi_FI(dot)UTF-8(at)euro"
LC_IDENTIFICATION="fi_FI(dot)UTF-8(at)euro"
LC_ALL=

So, under Linux nowadays, UTF-8 is used very much.
Just as Windows recommends everybody to move into
native Windows Unicode characters (UCS-2), under Linux
it is recommended to move into UTF-8. Both are UNICODE
character encodings. UCS-2 encoding is just simpler: just
an integer, that has a numerical value.

The reason for the popularity of UTF-8 under Linux is, that each
program needs to be adjusted very little to be able to move
from LATIN1 style encoding into UTF-8.

Happy studying about Unicode character sets :)

Regards,
Marko Ristola

In response to

Responses

Browse pgsql-odbc by date

  From Date Subject
Next Message Marko Ristola 2005-09-08 17:38:38 Re: Continuing encoding fun....
Previous Message Merlin Moncure 2005-09-08 17:07:12 Re: Application bottlenecks