Re: [WIP] collation support revisited (phase 1)

From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-22 17:03:26
Message-ID: 488612DE.5060206@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
>> I was trying to sort out the problem with not creating new catalog for
>> character sets and I came up following ideas. Correct me if my ideas are
>> wrong.
>>
>> Since collation has to have a defined character set.
>
> Not really. AIUI at least glibc and ICU define a collation over all
> possible characters (ie unicode). When you create a locale you take a
> subset and use that. Think about it: if you want to sort strings and
> one of them happens to contain a chinese charater, it can't *fail*.
> Note strcoll() has no error return for unknown characters.

It has.
See http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

The strcoll() function may fail if:

[EINVAL]
[CX] The s1 or s2 arguments contain characters outside the domain of
the collating sequence.

>> I'm suggesting to use
>> already written infrastructure of encodings and to use list of encodings in
>> chklocale.c. Currently databases are not created with specified character
>> set but with specified encoding. I think instead of pointing a record in
>> collation catalog to another record in character set catalog we might use
>> only name (string) of the encoding.
>
> That's reasonable. From an abstract point of view collations and
> encodings are orthoginal, it's only when you're using POSIX locales
> that there are limitations on how you combine them. I think you can
> assume a collation can handle any characters that can be produced by
> encoding.

I think you are not correct. You cannot use collation over all UNICODE. See
http://www.unicode.org/reports/tr10/#Common_Misperceptions. Same characters can
be ordered differently in different languages.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-07-22 17:15:49 Re: Postgres-R: primary key patches
Previous Message Markus Wanner 2008-07-22 16:59:38 Re: Postgres-R: primary key patches