Re: Abbreviated keys for Numeric

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-03-23 01:48:22
Message-ID: 87h9tcbkzn.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Peter" == Peter Geoghegan <pg(at)heroku(dot)com> writes:

Peter> I don't think it's that simple. The actual point of abbreviation
Peter> is to amortize the total cost of performing O(n log n)
Peter> comparisons (the average case, which is usually assumed by
Peter> infrastructure like sortsupport), by performing an encoding
Peter> process n times. That encoding process may be more expensive
Peter> than each of the comparisons. Sometimes significantly more
Peter> expensive.

But the cost of the encoding depends only on the number of non-null
values. Again, the nulls turn out to be irrelevant.

Peter> Forget about abbreviation entirely for a minute - consider plain
Peter> old sorting of strxfrm() blobs, which for example the glibc
Peter> documentation recommends (with some caveats) as the preferred
Peter> approach to sorting strings while respecting collation. Suppose
Peter> that your applications happens to frequently encounter fully
Peter> sorted arrays of strings when passed an array of strings to
Peter> sort. If you were using our qsort() implementation, which,
Peter> rather questionably, has a pre-check for fully sorted input
Peter> (that throws away work when the pre-check fails), then you'll
Peter> only have n comparisons. Even if an encoding is only as
Peter> expensive as an actual comparison, the potential to lose with
Peter> encoding clearly exists. Similarly, we don't abbreviate for
Peter> top-n heap sorts, even though all of those abbreviated keys may
Peter> be used for at least one comparison.

Nothing in the above paragraph has the slightest relevance to even the
text abbreviation code, much less the numeric one.

Peter> Now, I hear what you're saying about weighing the costs and the
Peter> benefits -- you point out that abbreviation of NULL values is
Peter> not a cost, which is certainly true.

It's not a cost because it DOESN'T HAPPEN. The abbreviation function is
not called for null inputs.

Peter> But if the total number of NULL values is so high that they
Peter> dominate, then abbreviation might well be the wrong thing for
Peter> that reason alone.

No. This is the point that you're getting wrong. The amount of time
saved by the abbreviation code depends ONLY on the number and
cardinality of the NON-NULL values. Also, the time _lost_ by the
abbreviation code if we fail to abort in pathological cases STILL
depends only on the number of non-null values, so there's no benefit in
aborting even in a case when we have 1000 equal non-null values mixed in
with tens of millions of nulls.

Let me give you an actual example:

create table numtest2 as
select floor(random() * 200)::numeric as a
from generate_series(1,1000000);
create table numtest3 as
select a from (select floor(random() * 200)::numeric as a
from generate_series(1,1000000)
union all
select null from generate_series(1,4000000)) s
order by random();

So one table has a million non-null rows with 200 distinct values, and
the other has the same plus 4 million null rows.

Using my code, which ignores nulls in the cost model:

select * from numtest2 order by a offset 100000000; -- 1.5 seconds
select * from numtest3 order by a offset 100000000; -- 3.6 seconds

With abbreviation disabled entirely:

select * from numtest2 order by a offset 100000000; -- 4.5 seconds
select * from numtest3 order by a offset 100000000; -- 6.8 seconds

Notice that while proportionally the speedup is only <2x rather than 3x
for the version with nulls, the absolute difference in time is the same
in both cases - abbreviation saved us ~3 seconds in both cases.

(This relation remains true for larger values too: make it 19 million
nulls rather than 4 million and the timings are 11.5 sec / 14.5 sec,
still a 3 sec difference.)

Your version would have aborted abbrevation on that second query, thus
losing a 3 second speedup. How on earth is that supposed to be
justified? It's not like there's any realistically possible case where
your version performs faster than mine by more than a tiny margin.

Peter> Where I think your argument is stronger is around the cost of
Peter> actually aborting in particular (even though not much work is
Peter> thrown away). Scanning through the memtuples array once more
Peter> certainly isn't free.

Yes, the cost of actually aborting abbreviation goes up with the number
of nulls. But your version of the code makes that WORSE, by making it
more likely that we will abort unnecessarily.

If we use the raw value of memtuplecount for anything, it should be to
make it LESS likely that we abort abbreviations (since we'd be paying a
higher cost), not more. But benchmarking doesn't suggest that this is
necessary, at least not for numerics.

Peter> And you could take the view that it's always worth the risk,
Peter> since it's at most a marginal cost saved if the first ~10k
Peter> tuples are representative, but a large cost incurred when they
Peter> are not. So on reflection, I am inclined to put the check for
Peter> non-null values back in.

Right result but the wrong reasoning.

Peter> I also concede that the cost of abbreviation is lower with your
Peter> numeric encoding scheme than it is for that of the text opclass,
Peter> which favors this approach.

If you do some benchmarks on text columns, you'll find that this is a
win there too.

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2015-03-23 01:52:35 Re: Abbreviated keys for Numeric
Previous Message Amit Langote 2015-03-23 01:47:54 Re: Parallel Seq Scan