Re: [WIP] collation support revisited (phase 1)

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-12 13:08:43
Message-ID: 20080712130843.GA12026@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
> Background:
> We specify encoding in initdb phase. ANSI specify repertoire, charset,
> encoding and collation. If I understand it correctly, then charset is
> subset of repertoire and specify list of allowed characters for
> language->collation. Encoding is mapping of character set to binary format.
> For example for Czech alphabet(charset) we have 6 different encoding for
> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.

Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.

A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).

> I think if we support UTF8 encoding, than it make sense to create own
> charsets, because system locales could have defined collation for that. We
> need conversion only in case when client encoding is not compatible with
> charset and conversion is not defined.

The problem is that locales in POSIX are defined on an encoding, not a
charset. In locale en_US.UTF-8 doesn't actually sort any differently
than en_US.latin1, it's just that japanese characters are not
representable in the latter.

locale-gen can create a locale for any pair of (locale code,encoding),
whether the result is meaningful is another question.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-07-12 15:17:03 Re: [WIP] collation support revisited (phase 1)
Previous Message Abhijit Menon-Sen 2008-07-12 11:34:32 Re: posix advises ...