Re: improving GROUP BY estimation

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improving GROUP BY estimation
Date: 2016-03-13 15:24:17
Message-ID: CAEZATCU3hhRjGbvCydP-i3ik4FGX9OcXsK2ZG5TGuVnJOvb=kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 March 2016 at 13:10, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> The risk associated with over-estimation is that we may switch from
> HashAggregate to GroupAggregate, and generally select paths better
> suited for large number of rows. Not great, because the plan may not be
> optimal and thus run much slower - but that usually only happens on
> small tables, because on large tables we would probably end up using
> those same paths anyway.
>
> With under-estimation, the main risks are much graver, as we may end up
> using HashAggregate only to get killed by OOM. When we're lucky not to
> hit that, my experience this often leads to a cascade of NestedLoop
> nodes processing orders of magnitude more tuples than expected. Awful.
>
> So I think the risk is asymmetric, and that's why I like the new
> estimator more, because it tends to over-estimate, but in a very
> reasonable way.
>

Agreed. Under-estimating is worse than over-estimating.

- reldistinct *= rel->rows / rel->tuples;
+ reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)

Looking at this, I agree that this new formula from [1] is generally
better than the old one in most (but not all) cases.

One problem with it is that it no longer takes into account
rel->tuples, which means that when returning all the tuples from the
table, it no longer gives an estimate equal to the total number of
distinct values in the table. In fact it tends to underestimate when
returning nearly all the rows from the table.

The old formula is clearly not a good general-purpose estimate.
However, there is one case where it does return an accurate estimate
-- when all the values in the underlying table are distinct. In this
case the new formula consistently produces underestimates, while the
old formula works well. For example:

CREATE TABLE t AS SELECT (100000000 * random())::int a,
i::int b
FROM generate_series(1,1000000) s(i);
ANALYSE t;

EXPLAIN ANALYSE
SELECT a FROM t GROUP BY a;

So there are clearly around 1 million distinct values for "a", but the
new formula gives an estimate of around 630,000. That's not a massive
underestimate, but it's sufficiently obvious and visible that it would
be good to do better if we can.

I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct)

This comes from [2], which gives a general formula for selection
without replacement, and then gives the above as an approximation to
the uniform distribution case. It's not really made clear in that
paper, but that approximation is actually the "with replacement"
approximation, but starting from different initial assumptions to give
a formula that has better boundary conditions and special-case
behaviour, as described below.

For the record, here's a quick analysis of how the 2 formulae come about:

Assume there are:
N rows in the table
n distinct values (n <= N)
p rows are chosen at random (p <= N)

so the problem is to estimate how many distinct values there will be
in the p rows chosen. Both approaches work by first estimating the
probability that some particular value x is *not* chosen.

[1] starts from the assumption that each row of the table has a
probability of 1/n of being equal to x. So the probability that x is
not the first value chosen is

P(not x, 1) = 1 - 1/n

Then, if the value chosen first is replaced, the probability that x is
not the second value chosen is the same. The "with replacement"
approximation makes each choice an independent probability, so the
overall probability that x is not in any of the p rows chosen is just
the product of the p individual probabilities, which is just

P(not x, p) = (1 - 1/n)^p

Then of course the probability that x *is* chosen at least once in the p rows is

P(x, p) = 1 - (1 - 1/n)^p

Finally, since there are n possible values of x, and they're all
equally likely in the table, the expected number of distinct values is

D(p) = n * (1 - (1 - 1/n)^p)

The flaw in this approach is that for large p the "with replacement"
approximation gets worse and worse, and the probabilities P(x, p)
systematically under-estimate the actual probability that x is chosen,
which increases as more values not equal to x are chosen. By the time
the last value is chosen P(x, p=N) should actually be 1.

[2] starts from a different initial assumption (uniform distribution
case) -- for any given value x, assume that the table contains N/n
instances of x (ignoring the fact that that might not be an integer).
Note that with this assumption there is guaranteed to be at least one
instance of x, which is not the case with the above approach.

Now consider the first instance of x in the table. If we choose p rows
from the table, the probability that that first instance of x is not
chosen is

P(not x[1], p) = 1 - p / N

Making the same "with replacement" simplification, the probability
that the second instance of x is not chosen is the same, and they're
independent probabilities again. So, if there are N/n instances of x,
the overall probability that none of them is chosen is

P(not x, p) = (1 - p / N)^(N/n)

So then the formula for the expected number of distinct values becomes

D(p) = n * (1 - (1 - p / N)^(N/n))

This alternate formula has a few nice properties:

1). D(p=0) = 0

2). D(p=N) = n (choosing all the rows results in all the distinct
values being chosen)

3). When n=N, D(p) = p (when all the values in the table are distinct,
so are all the values chosen). This case matches the current formula.

The first formula for D(p) lacks properties (2) and (3).

The reason this second formula generally works better is that the
"with replacement" approximation is only being made in for up to (N/n)
selections, rather than up to N selections. So it remains a good
approximation as long as (N/n) is large compared to N, i.e., as long
as n is fairly large. In fact even for n=1 or 2, it still works
adequately if the results are rounded to the nearest integer.

The following snipet can be used in gnuplot to visualise the results
for various values of n and N. I've included the exact "without
replacement" formula too, by rewriting it in terms of the gamma
function:

N=1000000.0; n=500000.0; plot [p=0:N] [0:n] p*n/N, n*(1-(1-1/n)**p),
n*(1-(1-p/N)**(N/n)), n*(1 - exp(lgamma(N-N/n+1) + lgamma(N-p+1) -
lgamma(N+1) - lgamma(N-N/n-p+1)))

For most values of N and n, the approximation from [2] is almost
indistinguishable from the exact formula, whereas the formula from [1]
tends to underestimate when selecting a large number of distinct
values (e.g., try setting n=900000 in the plot above).

Regards,
Dean

[1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
[2] http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2016-03-13 17:45:12 Re: PoC: Partial sort
Previous Message Greg Stark 2016-03-13 12:32:23 Re: [GENERAL] OS X 10.11.3, psql, bus error 10, 9.5.1