Re: improve Chinese locale performance

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Quan Zongliang <quanzongliang(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improve Chinese locale performance
Date: 2013-07-23 12:32:08
Message-ID: CA+TgmoYmR8YkvWtCAwf3YUBBso=GFByT-Out9DKMTjWMRpXBdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 22, 2013 at 12:49 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> I think part of the problem is that we call strcoll for each comparison,
>> instead of doing strxfrm once for each datum and then just strcmp for
>> each comparison. That is effectively equivalent to what the proposal
>> implements.
>
> Fwiw I used to be a big proponent of using strxfrm. But upon further
> analysis I realized it was a real difficult tradeoff. strxrfm saves
> potentially a lot of cpu cost but at the expense of expanding the size
> of the sort key. If the sort spills to disk or even if it's just
> memory bandwidth limited it might actually be slower than doing the
> additional cpu work of calling strcoll.
>
> It's hard to see how to decide in advance which way will be faster. I
> suspect strxfrm is still the better bet, especially for complex large
> character set based locales like Chinese. strcoll might still win by a
> large margin on simple mostly-ascii character sets.

The storage blow-up on systems I've tested is on the order of 10x.
That's possibly fine if the data still fits in memory, but it pretty
much sucks if it makes your sort spill to disk, which seems like a
likely outcome in many cases.

But I don't have much trouble believing the OP's contention that he's
coded a locale-specific version that is faster than the version that
ships with the OS. On glibc, for example, we copy the strings we want
to compare, so that we can add a terminating zero byte. The first
thing that glibc does is call strlen(). That's pretty horrible, and
I'm not at all sure the horror ends there, either.

It would be great to have support for user-defined collations in
PostgreSQL. Let the user provide their own comparison function and
whatever else is needed and use that instead of the OS-specific
support. Aside from the performance advantages, one could even create
collations that have the same names and orderings on all platforms we
support. Our support team has gotten more than one inquiry of the
form "what's the equivalent of Linux collation XYZ on Windows?" - and
telling them that there is no exact equivalent is not the answer the
want to hear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-07-23 12:35:31 Re: [9.4 CF 1] And then there were 5
Previous Message Tim Kane 2013-07-23 12:06:26 Suggestion for concurrent index creation using a single full scan operation