Re: Abbreviated keys for Numeric

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-03-22 23:44:29
Message-ID: CAM3SWZRsvR758mWWdpNbohnN-qQhqSpSJAis8nhm2zmhDUgNpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 20, 2015 at 11:58 PM, Andrew Gierth
<andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:
> Comparisons between nulls and nulls, or between nulls and non-nulls, are
> cheap; only comparisons between non-nulls and non-nulls can be
> expensive.
>
> The purpose of abbreviation is to replace expensive comparisons by cheap
> ones where possible, and therefore the cost model used for abbreviation
> should ignore nulls entirely; all that matters is the number of non-null
> values and the probability of saving time by abbreviating them.

I don't think it's that simple. The actual point of abbreviation is to
amortize the total cost of performing O(n log n) comparisons (the
average case, which is usually assumed by infrastructure like
sortsupport), by performing an encoding process n times. That encoding
process may be more expensive than each of the comparisons. Sometimes
significantly more expensive.

Forget about abbreviation entirely for a minute - consider plain old
sorting of strxfrm() blobs, which for example the glibc documentation
recommends (with some caveats) as the preferred approach to sorting
strings while respecting collation. Suppose that your applications
happens to frequently encounter fully sorted arrays of strings when
passed an array of strings to sort. If you were using our qsort()
implementation, which, rather questionably, has a pre-check for fully
sorted input (that throws away work when the pre-check fails), then
you'll only have n comparisons. Even if an encoding is only as
expensive as an actual comparison, the potential to lose with encoding
clearly exists. Similarly, we don't abbreviate for top-n heap sorts,
even though all of those abbreviated keys may be used for at least one
comparison.

Now, I hear what you're saying about weighing the costs and the
benefits -- you point out that abbreviation of NULL values is not a
cost, which is certainly true. But if the total number of NULL values
is so high that they dominate, then abbreviation might well be the
wrong thing for that reason alone. In general, string transformation
is not recommended for sorting very small lists of strings.

Where I think your argument is stronger is around the cost of actually
aborting in particular (even though not much work is thrown away).
Scanning through the memtuples array once more certainly isn't free.
And you could take the view that it's always worth the risk, since
it's at most a marginal cost saved if the first ~10k tuples are
representative, but a large cost incurred when they are not. So on
reflection, I am inclined to put the check for non-null values back
in. I also concede that the cost of abbreviation is lower with your
numeric encoding scheme than it is for that of the text opclass, which
favors this approach.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kouhei Kaigai 2015-03-23 00:12:19 Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
Previous Message Mark Kirkwood 2015-03-22 23:06:16 Re: Remove fsync ON/OFF as a visible option?