ORDER BY and Unicode

From: "M(dot) Bastin" <marcbastin(at)mindspring(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: ORDER BY and Unicode
Date: 2004-05-12 12:39:58
Message-ID: a06100515bcc7c7cdeb35@[192.168.0.101]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

There seems to be a big problem with Unicode for
which a solution might already exist. Somebody
had the following problem on another mailing
list. My suggestion is at the bottom of this
message but if another solution already exists
I'd like to hear about it.

The problem is that special characters aren't
treated right under Unicode. Here are a few
examples:

1. "UPPER('é')" doesn't work.
(That's an accented "e" in there if it doesn't
come through in your e-mail application)

The implication of this is that

SELECT ... WHERE UPPER(mycolumn) LIKE UPPER('my search string')

doesn't give the functionality you'd want.
UPPER and LOWER seem to work on ASCII only. The
Greek, French, etc. are out of luck.

2. "ORDER BY mycolumn" gives a wrong sort order.

Uppercase ASCII characters come first, then
lowercase ASCII, then accented characters...
This really isn't what a human would like to see.

I think the two examples above illustrate this
Unicode problem quite well. Is there an existing
solution? If not could we work together on
creating one, as suggested at the very bottom of
this message?

Thanks,

Marc

------------------------------------------------------

You can use the translate function to solve your problem.
<http://www.postgresql.org/docs/7.4/interactive/functions-string.html>

e.g. for the letter "a":

SELECT * FROM mytable ORDER BY
translate(textcolumn, 'àáâäÀÁÂÄa', 'AAAAAAAAA');

Then you build an index like this to speed things up:

CREATE INDEX MyIndex ON MyTable
(translate(textcolumn, 'àáâäÀÁÂÄa', 'AAAAAAAAA'));

And your select queries will also be case and
accent independent from then on e.g:

SELECT * FROM mytable WHERE
translate(textcolumn, 'àáâäÀÁÂÄa', 'AAAAAAAAA')
LIKE translate('my search string', 'àáâäÀÁÂÄa',
'AAAAAAAAA');

Since the alphabet is very long you're better off
creating your own function that does this
translation for the whole alphabet and then you'd
get something like this:

SELECT * FROM mytable WHERE
MySimpleABC(textcolumn) LIKE MySimpleABC('my
search string') ORDER BY MySimpleABC(textcolumn);

Your index would be like this:

CREATE INDEX MyIndex ON MyTable (MySimpleABC(textcolumn));

Cheers,

Marc

PS: Maybe we should work together to create the
mother of all functions that would do this for
all of Unicode? Anybody else up to this?

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Stephan Szabo 2004-05-12 13:51:08 Re: ORDER BY and Unicode
Previous Message Sandro Martinez 2004-05-12 07:07:12 ayuda