Re: Locale agnostic unicode text

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale agnostic unicode text
Date: 2005-01-24 18:00:50
Message-ID: 87y8ei4sh9.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Greg Stark <gsstark(at)mit(dot)edu> writes:
> >
> > So it's slow but not spectacularly awful.
>
> glibc is not the world.

Sorry, I should have said "It's not *necessarily* spectacularly awful"

> I tried Dawid's functions on Mac OS X, being a
> random non-glibc platform that I happen to use. Using some text data
> I had handy (44500 lines, 1.9MB) I made a single-column text table and
> timed
> explain analyze select * from foo order by f1;
> The results were
> In C locale, SQL_ASCII encoding: 820 ms
> In C locale, UNICODE encoding: 825 ms
> Using Dawid's functions: 62010 ms
> Stripped-down functions: 21010 ms

I don't think these are fair comparisons though. The C locale probably
short-circuits much of the work that strxfrm/strcoll have to do for other
locales. I think the fair comparison is to compare a database initdb'd in a
non-C locale like en_US using strcoll with no setlocale calls against one
calling setlocale twice for every record.

In any case it's true, some platforms have bad implementations of things.

But if you have to do this (and I have to do this too) it doesn't really
matter that some platforms don't handle it well. This just means those
platforms aren't feasible and I'm forced to use glibc-based platforms. It
doesn't mean I should dismiss Postgres for the project.

Incidentally Dawid, if you are on a platform like OSX with a performance
problem with this there is a possible optimization you can use. If you store
and update the data rarely but sort it frequently you can store the output of
strxfrm in a bytea column. Then you can sort on that column without having to
call setlocale repeatedly.

If you have few queries that can be optimized to always use indexes you can
even store this information in a functional index instead of denormalizing the
table.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-01-24 18:25:01 Re: Locale agnostic unicode text
Previous Message Tom Lane 2005-01-24 17:58:58 Re: [COMMITTERS] pgsql: Disallow LOAD to non-superusers.