Re: How to add locale support for each column?

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: How to add locale support for each column?
Date: 2004-09-26 02:42:09
Message-ID: 873c15agzi.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com> writes:

> But shouldn't the comparison be against sorting on col not lower(col)?
> strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem
> comparable to lower(col). Some collations do treat 'A' and 'a' as be
> adjacent in sort order, but that's not a guarantee, so it's not valid to
> say, "everywhere you'd use lower(col) you can use strxfrm instead."

Well, in my implementation strxfrm is a postgresql function. So I wanted to
compare it with an expression that had at least as much overhead as a
postgresql expression with a single function call.

> And in past numbers you sent, it looked like the amounts were: 1s for sort
> on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col). That
> doesn't seem negligible to me

Right, I amended my "negligible" claim. It's a significant but reasonable
speed. A 1.5s delay on sorting 100k rows is certainly not the kind of
intolerable delay that would make the idea of switching locales intolerable.

> unless that doesn't grow linearly with the number of rows.

Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s
delay on sorting 2M records isn't really fatal. I certainly wouldn't want to
remove the ability to sort using strcmp if the data is ascii or binary. But if
you're going to use locale collation order it's going to be slower. strxfrm
has to do quite a bit of work. Even a postgres-internal mechanism is going to
have to do that same work.

The only time you could save is the time it takes to look up "en_US" in a list
(or hash) of cached locales and switch a pointer. I suspect that's going to be
on a small (but not negligible) portion the overhead. I guess this is subject
to analysis, I'll try to do a gprof run at some point to answer that.

> > I see no reason to think Postgres's implementation of looking up xfrm rules
> > for the specified locale will be any faster than the OS's. We know some OS's
> > suck but some certainly don't.
>
> But do you have to change locales per row or per sort? Presumably, a built
> in implementation may be able to do the latter rather than the former.

We certainly need the ability to change the locales per-row, in fact possibly
multiple times per row.

Consider

select en,fr
from translations
order by en,fr

Which is actually something reasonable I could have to do in my current
project.

However changing locales should be nigh-instantaneous, it really ought to be
just changing a pointer. And in the API Tom foresees shouldn't even happen.
The only cost of sorting on many locales (aside from the initial load) would
be in the reduced cache hit rate from using more locale tables.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ross J. Reedstrom 2004-09-26 03:03:34 Re: 'TID index'
Previous Message Bruce Momjian 2004-09-26 02:25:47 Re: Make configure use krb5-config

Browse pgsql-patches by date

  From Date Subject
Next Message Stephan Szabo 2004-09-26 03:11:41 Re: How to add locale support for each column?
Previous Message Bruce Momjian 2004-09-26 02:14:50 Re: libpq verinfo patch