Re: B-Tree support function number 3 (strxfrm() optimization)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-09-09 21:00:59
Message-ID: CA+TgmoYV7zzWNHx0GSFn0N=CoSgdGZEu=yYUYLT7_wrB9H9U0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 4, 2014 at 5:46 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Eh, maybe? I'm not sure why the case where we're using abbreviated
>> keys should be different than the case we're not. In either case this
>> is a straightforward trade-off: if we do a memcmp() before strcoll(),
>> we win if it returns 0 and lose if returns non-zero and strcoll also
>> returns non-zero. (If memcmp() returns non-zero but strcoll() returns
>> 0, it's a tie.) I'm not immediately sure why it should affect the
>> calculus one way or the other whether abbreviated keys are in use; the
>> question of how much faster memcmp() is than strcoll() seems like the
>> relevant consideration either way.
>
> Not quite. Consider my earlier example of sorting ~300,000 cities by
> country only. That's a pretty low cardinality attribute. We win big,
> and we are almost certain that the abbreviated key cardinality is a
> perfect proxy for the full key cardinality so we stick with
> abbreviated keys while copying over tuples. Sure, most comparisons
> will actually be resolved with a "memcmp() == 0" rather than an
> abbreviated comparison, but under my ad-hoc cost model there is no
> distinction, since they're both very much cheaper than a strcoll()
> (particularly when we factor in the NUL termination copying that a
> "memcmp() == 0" also avoids). To a lesser extent we're also justified
> in that optimism because we've already established that roughly the
> first 8 bytes of the string are bitwise equal.
>
> So the difference is that in the abbreviated key case, we are at least
> somewhat justified in our optimism. Whereas, where we're just eliding
> fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
> sort, it's totally opportunistic to chance a "memcmp() == 0". The
> latter optimization can only be justified by the fact that the
> memcmp() is somewhere between dirt cheap and free. That seems like
> soemthing that should significantly impact the calculus.

Boiled down, what you're saying is that you might have a set that
contains lots of duplicates in general, but not very many where the
abbreviated-keys also match. Sure, that's true. But you might also
not have that case, so I don't see where that gets us; the same
worst-case test case Heikki developed the last time we relitigated
this point is still relevant here. In order to know how much we're
giving up in that case, we need the exact number I asked you to
provide in my previous email: the ratio of the cost of strcoll() to
the cost of memcmp().

I see that you haven't chosen to provide that information in any of
your four responses.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-09-09 21:09:43 Re: Spinlocks and compiler/memory barriers
Previous Message Robert Haas 2014-09-09 20:48:45 Re: pg_background (and more parallelism infrastructure patches)