Re: [WIP] collation support revisited (phase 1)

From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-12 08:02:24
Message-ID: 48786510.9080502@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera napsal(a):
> Zdenek Kotala escribió:
>
>> The example is when you have translation data (vocabulary) in database.
>> But the reason is that ANSI specify (chapter 4.2) charset as a part of
>> string descriptor. See below:
>>
>> — The length or maximum length in characters of the character string type.
>> — The catalog name, schema name, and character set name of the character
>> set of the character string type.
>> — The catalog name, schema name, and collation name of the collation of
>> the character string type.
>
> We already support multiple charsets, and are able to do conversions
> between them. The set of charsets is hardcoded and it's hard to make a
> case that a user needs to create new ones. I concur with Martijn's
> suggestion -- there's no need for this to appear in a system catalog.
>
> Perhaps it could be argued that we need to be able to specify the
> charset a given string is in -- currently all strings are in the server
> encoding (charset) which is fixed at initdb time. Making the system
> support multiple server encodings would be a major undertaking in itself
> and I'm not sure that there's a point.
>

Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, encoding
and collation. If I understand it correctly, then charset is subset of
repertoire and specify list of allowed characters for language->collation.
Encoding is mapping of character set to binary format. For example for Czech
alphabet(charset) we have 6 different encoding for 8bit ASCII, but on other side
for UTF8 there is specified multi charsets.

I think if we support UTF8 encoding, than it make sense to create own charsets,
because system locales could have defined collation for that. We need conversion
only in case when client encoding is not compatible with charset and conversion
is not defined.

Any comments?

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-07-12 08:16:43 Re: Vacuuming leaked temp tables (once again)
Previous Message David E. Wheeler 2008-07-12 04:32:45 Re: PATCH: CITEXT 2.0 v3