Locale agnostic unicode text

From: Dawid Kuroczko <qnex42(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Locale agnostic unicode text
Date: 2005-01-21 22:14:55
Message-ID: 758d5e7f05012114147573e75e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello!

One of least liked by me features of PostgreSQL is a need to specify
LC_CTYPE an LC_COLLATE at initdb time. Especially if you intend to
put into DB texts in
different languages (say, Polish, French, German and Russian) and use
functions like lower() or ORDER BY these texts. :)

I guess the need to select these two locales at initdb time is to
avoid problems with corrupted indexes (column first indexed
with lower() function when setlocale('C'), then accessed when
setlocale('ru_RU')... oops. etc.). Probably there are more of
those. :)

To solve this I thought about creating function lower(text, locale), say:
lower ('Name', 'pl_PL.utf8');
Simple enough, I used plperl (plperlu actually) to make it happen and
while doing so I've noticed that perl does unicode-lowercasing/uppercasing on
its own accord, doesn't need locales to it, and does it pretty well.
So the resulting function is:

CREATE DOMAIN unitext text;
CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
utf8::decode($_[0]);
return lc($_[0]);
$$ LANGUAGE plperlu IMMUTABLE;

And it seems to work fine regardless of locale set.

So... I thoght, why not use this unitext to sort texts?
So I've created functions, operators and operator class,
This time setlocale() was needed to get the behaviour
I needed (database initdb'ed to 'C', my order set to 'pl_PL',
or whatever locale I need at given moment).

I've attached a 'draft' of unitext,sql, which should create
unitext datatype capable of sorting according Polish locale.
It does not work as it should and I don't know how to make
it work. For example:

SELECT * FROM uni_tab ORDER BY uni_column;

...sorts according to 'C' (initdb's) locale.

I can force my way by issuing:

SELECT * FROM uni_tab ORDER BY uni_column USING <;

...but I would like to force ORDER BY using operators
provided by me without this 'USING <' clause.

Any hints how to do it?

Regards,
Dawid

PS: I like perl's lc() and uc() behaviour in utf8 mode. I'm thinking about
trying to "port" it from perl source as a C-language extension for PostgreSQL.
What do you think about it?

Attachment Content-Type Size
unitext.sql text/x-sql 4.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Devrim GUNDUZ 2005-01-21 23:00:41 Re: [pgsql-hackers] Re: Translations at pgfoundry
Previous Message Jaime Casanova 2005-01-21 21:34:58 TIP9