Re: How to add locale support for each column?

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: How to add locale support for each column?
Date: 2004-09-26 06:51:53
Message-ID: 87wtyh8quu.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches


Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com> writes:

> I'd thought there was still a question of where such a thing would live?
> If it's an external project or a contrib thing, the above might be true,
> but if it's meant to be a truly supported internal builtin then the
> function call cost is part of the implementation and is significant data
> that cannot be thrown out.

Well it seems to be consensus that it would be good to have a complete locale
handling as envisioned by the spec. But I don't see that as relevant to this
discussion. I'm comparing a function handling strxfrm with a function handling
lower() and with sorting on a column directly. The point was to demonstrate
that it was practical (if not ideal) to switch locales repeatedly, especially
when you take into account that *any* function will have some overhead
anyways. If it were built into postgres the overhead might be lower, but I
doubt by much, and in any case it's just not an option for me now.

> Aparently the message I responded to hung around for a while before
> getting to me because they came out of order.

That seems to be happening a lot lately.

> I agree in general, but if part of this involves forcing "C" locale (see
> my question at the end) and so any locale sorting is forced to do this,
> then if a query in en_US currently takes 7 seconds, but now will take 17,
> I think that's significant.

I compared against sorting in C locale. It would be interesting to know how
much of the penalty came from simply having to do the work strxfrm vs the
overhead of switching locales. The former is inevitable. *Any* implementation
of locale collation orders is going to have to do it.

The latter is maybe something we can work on reducing, though not without
considerable cost in terms of code complexity. It will mean either lobbying
for API changes in libc or growing the codebase of postgres by the size of an
entire i18n package. I strongly suspect maintaining i18n packages turns out to
be a *lot* of work.

> Was your strxfrm comparison against a column comparison in "C" locale then
> rather than one using en_US or some other such locale?

C.

I could compare it against sorting in a database created in a given locale,
but I suspect I'll find gprof output more directly helpful.

> But we don't presumably have to look up the locale each time as you note.

The question is whether looking up the locale is significant compared to
executing strxfrm. I suspect it'll be significant, but not the majority of the
time.

The real question is whether speeding up sorting by removing that overhead is
worth the complexity of abandoning libc.

I would strongly urge people to consider writing postgres support to assume
standard libc functionality. If we can convince glibc and BSD libc people to
add a more reasonable interface we can optionally use it, just as we do other
more modern interfaces to old features.

If some platforms are just terminally braindead we should look for ways to
support people installing gnu libintl (or whatever the glibc i18n chunk is
called) separately and using it like we do libreadline, libkrb, or libz.

> More importantly, do we have know whether or not this function really works
> properly in non-C locales? Is the strxfrm result guaranteed to sort
> correctly (using strcoll) in others?

Well you wouldn't want to use strcoll at all actually, just strcmp. Actually
Conway's reimplementation returns a bytea which is probably more correct than
my original plan to return text. Though I should check whether postgres has to
do extra work to sort bytea data instead of varchar data, especially since
strxfrm should never return strings containing nuls.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2004-09-26 13:28:34 Re: Use of zlib
Previous Message Dennis Bjorklund 2004-09-26 06:46:49 Re: Get rid of Money

Browse pgsql-patches by date

  From Date Subject
Next Message Magnus Hagander 2004-09-26 15:03:55 Re: plpython win32
Previous Message Stephan Szabo 2004-09-26 03:11:41 Re: How to add locale support for each column?