Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-09-04 21:46:18
Message-ID: CAM3SWZRNUr4Z5hijt=Xvdma-5uv2gg3+922drOZLwSoHNh5MkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Eh, maybe? I'm not sure why the case where we're using abbreviated
> keys should be different than the case we're not. In either case this
> is a straightforward trade-off: if we do a memcmp() before strcoll(),
> we win if it returns 0 and lose if returns non-zero and strcoll also
> returns non-zero. (If memcmp() returns non-zero but strcoll() returns
> 0, it's a tie.) I'm not immediately sure why it should affect the
> calculus one way or the other whether abbreviated keys are in use; the
> question of how much faster memcmp() is than strcoll() seems like the
> relevant consideration either way.

Not quite. Consider my earlier example of sorting ~300,000 cities by
country only. That's a pretty low cardinality attribute. We win big,
and we are almost certain that the abbreviated key cardinality is a
perfect proxy for the full key cardinality so we stick with
abbreviated keys while copying over tuples. Sure, most comparisons
will actually be resolved with a "memcmp() == 0" rather than an
abbreviated comparison, but under my ad-hoc cost model there is no
distinction, since they're both very much cheaper than a strcoll()
(particularly when we factor in the NUL termination copying that a
"memcmp() == 0" also avoids). To a lesser extent we're also justified
in that optimism because we've already established that roughly the
first 8 bytes of the string are bitwise equal.

So the difference is that in the abbreviated key case, we are at least
somewhat justified in our optimism. Whereas, where we're just eliding
fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
sort, it's totally opportunistic to chance a "memcmp() == 0". The
latter optimization can only be justified by the fact that the
memcmp() is somewhere between dirt cheap and free. That seems like
soemthing that should significantly impact the calculus.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-09-04 22:23:54 Re: Pg_upgrade and toast tables bug discovered
Previous Message Robert Haas 2014-09-04 21:29:36 Re: PL/pgSQL 2