Re: utf-8 and cultural sensitive sorting

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: alexs(at)advfn(dot)com
Cc: sknipe(at)tucows(dot)com, pgsql-general(at)postgresql(dot)org
Subject: Re: utf-8 and cultural sensitive sorting
Date: 2005-07-13 01:07:28
Message-ID: 20050713.100728.41628839.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> It depends what language you want to sort. Lots of languages do not
> have a sort alphabet. For example, Japanese. It can be quite
> difficult to sort unusual languages like this. I am not aware of any
> standard technique for sorting Japanese text other than keeping an
> arbitrarily sorted dictionary (courtesy of whatever the most popular
> Japanese dictionary at the time happens to be perhaps) and then doing
> hash lookups in the for indexing values. As you can imagine, this is
> not particularly fast. I have not actually tried this, but I expect
> PosgreSQL will simply sort in a fairly binary fashion. As in, it gets
> sorted in according to the binary value of the characters, or the
> UTF-8 offsets, or something like that.

Above is almost correct but usually sorting by the JIS code order is
enough for most Japanese applications (I believe same thing can be
said to Chinese). I do not recommend using locale for sorting
Japanese. It quite frequently happens that the locale support for
multibyte encodings is totally broken. See recent posting titled
"[GENERAL] Japanese words not distinguished" for more details.

If you have to live with UTF-8 database, I recommend turning off the
locale support and use CONVERT to sort Japanese. For example,

SELECT * FROM t1 ORDER BY CONVERT(col1 USING utf_8_to_euc_jp);

> On 12 Jul 2005, at 15:48, <sknipe(at)tucows(dot)com> <sknipe(at)tucows(dot)com> wrote:
>
> > Our product will be storing its character data in utf-8 format
> > (unicode encoding).
> >
> > What is the best way to achive cultural sensitive sorting using the
> > utf-8 data?
> >
> > Is it possible have the locale apply to a connection?
> >
> > If so, is the cultural sorting support mature in PostgreSQL?
> >
> > What type of performance can be expected as compared with the
> > normal c locale sorting?
> >
> > Thanks very much,
> >
> > Steve.
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo(at)postgresql(dot)org so that
> > your
> > message can get through to the mailing list cleanly
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tatsuo Ishii 2005-07-13 01:07:38 Re: Japanese words not distinguished
Previous Message Michael Fuhr 2005-07-13 00:57:00 Re: Temp tables...