Re: Collate order on Mac OS X, text with diacritics in UTF-8

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Martin Flahault <martin(at)billjobs(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Collate order on Mac OS X, text with diacritics in UTF-8
Date: 2010-01-14 02:41:35
Message-ID: 4B4E845F.80906@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 13/01/2010 11:15 PM, Martin Flahault wrote:

> It seems there is a problem with the collating order on BSD systems with
> diacritics using UTF8.
> If you put this text :
> a
> A
> à
> é
> e
> E
>
> in a UTF8 text file and use the "sort" command on it, you will have the
> same wrong output as with PostgreSQL :
> A
> E
> a
> e
> à
> é

First: PostgreSQL expects the OS to behave correctly and sort according
to the locale. It relies on the C library for this. If the C library
doesn't do it right, PostgreSQL won't do it right either. So you need to
get Mac OS X to do the right thing.

Your results match what I get on a Linux system without a properly
generated fr_FR.UTF-8 locale. Libc falls back on the "C" locale, which
sorts that way.

If I generate the fr_FR.UTF-8 locale and run the sort (on the file "x"),
I get the desired result:

LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
a
A
à
e
E
é

I don't know Mac OS X well, but this is making me wonder if maybe you're
just missing the required information for the locale, so libc is falling
back on the "C" locale.

(Of course, being Mac OS X there are probably at least three out of date
or simply false "man" pages describing the behaviour, none of which
reflect the reality of a magic config key buried somewhere in NetInfo,
for which the documentation is also completely out of date. Bitter? Me?
Yeah, I admin a bunch of OS X machines on a business network.)

Hmm... a quick test suggests that Mac OS X (testing on 10.4) at least
*thinks* it supports the fr_FR.UTF-8 locale:

osx104$ LANG=xxx LC_ALL=xxx locale
LANG="xxx"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL="fr_FR.UTF-8"

osx104$ locale -a | grep fr_FR
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8

... yet it clearly doesn't:

osx104$ LANG=C LC_ALL=C sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.ISO8859-1 LC_ALL=fr_FR.ISO8859-1 sort x
A
E
a
e
à
é

Mac OS X seems to keep its locale config in /usr/share/locale . Looking
there, there are clearly LC_COLLATE files for fr_FR.UTF-8 . However,
they're identical to those for en_US.UTF-8:

osx104$ cd /usr/share/locale
osx104$ diff fr_FR.UTF-8/LC_COLLATE en_US.UTF-8/LC_COLLATE

... so your OS's localized collation support is broken/missing, at least
if the same is true for more modern versions of OS X.

--
Craig Ringer

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Leigh Dyer 2010-01-14 04:35:07 Re: Backup strategies with significant bytea data
Previous Message Adrian Klaver 2010-01-14 01:10:37 Re: R: Re: R: Re: Weird EXECUTE ... USING behaviour