Re: unicode and sorting(at least)

From: Joel <rees(at)ddcom(dot)co(dot)jp>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: unicode and sorting(at least)
Date: 2004-06-25 02:31:14
Message-ID: 20040625112035.A4E0.REES@ddcom.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 25 Jun 2004 10:19:05 +0900 (JST)
Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> wrote

> > All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimes they can handle several langauges besides English, and are actually designed to handle a family of langauges.
>
> ISO 8xxx series are not encodings but character sets. For example,
> ISO-8859-1 can be expressed in 8-bit encoding form, it also can be
> expressed in 7-bit encoding form. This is called ISO-2022. I know that
> PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand
> for "8-bit encoded ISO-8859-1".
>
> Also, let's not mix together "languages" and "character
> sets". Langugaes are defined by human, not by computers. While
> character sets are perfectly definable by computers. More important
> thing is that a language can be expressed in several character
> sets. For example language Japanese can be expressed in EUC-JP of
> cousrse. It also can be expressed in ASCII by using ROMAJI script.

(Which isn't to say that everyone will find romanized Japanese easy to
read for meaning.)

But we should point out that there are several variations on the
romanization of Japanese (some of which are anything but regular).

> What I want to say here is talking about "languages" is almost
> useless and we have to talk about character sets and encodings.
>
> > The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCS series. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esoteric ones postulated, and there is room for future languages.
> >
> > So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family that the database's encoding was defined for(in postgres currently, choosing encoding down to the column level is available on several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't have UTF16 or UTF32 available)
> >
> > Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which as an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instance of Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/character set encoding is chosen for the database.
>
> The sorting order problem is not neccessary limited to "clutser
> vs. locale" one. My example about ROMAJI above raises another question
> "How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we
> could sort it in alphabetical order. But if we regard it as Japanaese,
> probably sorting in alphabetical order is not appropreate.

I think we should say that, while there are some contexts in which
ordinary alphabetic order would be okay, there are some, for instance,
in which we'd want to mirror the kana order as much as possible. (Not
exactly a straightforward map-this-code-point-to-this-collation-value
exercise, but should be doable.)

> This
> example shows that the sorting order should be defined by users or
> applications, not by systems or DBMSs. This is why the SQL standard
> has "COLLATION" concept IMO.
> ...

--
Joel <rees(at)ddcom(dot)co(dot)jp>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2004-06-25 02:33:06 Re: Renaming a schema
Previous Message Joel Matthew 2004-06-25 02:15:36 Re: unicode and sorting(at least)