Re: locale-specific sort algorithms undocumented?

From: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To: John Gunther <mail(at)bucksvsbytes(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: locale-specific sort algorithms undocumented?
Date: 2004-07-26 00:18:53
Message-ID: 20040725170328.F42584@megazone.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 20 Jul 2004, John Gunther wrote:

> In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my
> locale. I now find that sorting is very different with that setting: It

As a note, UTF-8 is not a locale. My guess is that you are actually using
something like en_XX.UTF-8 (for some XX) given your comments below.

> appears, through trial and error, that all non-alphanumeric characters
> are completely ignored by ORDER BY. Thus, traditional tricks like
> prefixing a name with a leading space to force it to the head of an
> alphabetized list don't work anymore. This is unexpected and undesired,
> but curiously, I can't find a word written anywhere that defines the
> sort process for various locale settings. Can anyone point me to any
> documentation on this?

Pretty much we punt to your system documentation, because we rely on the
facilities provided by it for handling the collation.
"PostgreSQL uses the standard ISO C and POSIX locale facilities provided
by the server operating system. For additional information refer to the
documentation of your system."

In practice, most systems seem to provide collation similar to
dictionaries for English (at least) which means that symbols and spaces
aren't significant for first pass comparisons.

> Second, can anyone offer any positive or negative implications of
> choosing a non-default locale (for English language databases)? I chose
> UTF because I perceive it as more of a standard encoding approach, but
> I'll have to reinit the server with a diffeent setting if I can't
> control its predilection to ignoring non-alphanumerics.

I think it mostly depends on what you want. en_US provides you dictionary
collation, which means that "A C" > "AB" but it also means that "AC" >
"Ab".

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2004-07-26 00:27:35 Re: locale-specific sort algorithms undocumented?
Previous Message Geoff Caplan 2004-07-26 00:11:47 Re: Sql injection attacks