Re: improving GROUP BY estimation

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improving GROUP BY estimation
Date: 2016-03-13 22:09:44
Message-ID: 1457906984.27231.16.camel@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
> On 4 March 2016 at 13:10, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
> >
> > The risk associated with over-estimation is that we may switch from
> > HashAggregate to GroupAggregate, and generally select paths better
> > suited for large number of rows. Not great, because the plan may
> > not be
> > optimal and thus run much slower - but that usually only happens on
> > small tables, because on large tables we would probably end up
> > using
> > those same paths anyway.
> >
> > With under-estimation, the main risks are much graver, as we may
> > end up
> > using HashAggregate only to get killed by OOM. When we're lucky not
> > to
> > hit that, my experience this often leads to a cascade of NestedLoop
> > nodes processing orders of magnitude more tuples than expected.
> > Awful.
> >
> > So I think the risk is asymmetric, and that's why I like the new
> > estimator more, because it tends to over-estimate, but in a very
> > reasonable way.
> >
> Agreed. Under-estimating is worse than over-estimating.
>
>
> -           reldistinct *= rel->rows / rel->tuples;
> +           reldistinct *= (1 - powl((reldistinct - 1) / reldistinct,
> rel->rows)
>
> Looking at this, I agree that this new formula from [1] is generally
> better than the old one in most (but not all) cases.
>
> One problem with it is that it no longer takes into account
> rel->tuples, which means that when returning all the tuples from the
> table, it no longer gives an estimate equal to the total number of
> distinct values in the table. In fact it tends to underestimate when
> returning nearly all the rows from the table.

Yes, that's a good point.

>
> The old formula is clearly not a good general-purpose estimate.
> However, there is one case where it does return an accurate estimate
> -- when all the values in the underlying table are distinct. In this
> case the new formula consistently produces underestimates, while the
> old formula works well. For example:
>
> CREATE TABLE t AS SELECT (100000000 * random())::int a,
>                          i::int b
>                     FROM generate_series(1,1000000) s(i);
> ANALYSE t;
>
> EXPLAIN ANALYSE
> SELECT a FROM t GROUP BY a;
>
> So there are clearly around 1 million distinct values for "a", but
> the new formula gives an estimate of around 630,000. That's not a
> massive underestimate, but it's sufficiently obvious and visible that
> it would be good to do better if we can.
>
>
> I think that a better formula to use would be
>
> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
> reldistinct)
>
> This comes from [2], which gives a general formula for selection
> without replacement, and then gives the above as an approximation to
> the uniform distribution case. It's not really made clear in that
> paper, but that approximation is actually the "with replacement"
> approximation, but starting from different initial assumptions to
> give a formula that has better boundary conditions and special-case
> behaviour, as described below.
>
> ...
>
> For most values of N and n, the approximation from [2] is almost
> indistinguishable from the exact formula, whereas the formula from
> [1] tends to underestimate when selecting a large number of distinct
> values (e.g., try setting n=900000 in the plot above).

Yes, I agree that formula you propose seems to behave better.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2016-03-13 22:46:41 Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Previous Message Jim Nasby 2016-03-13 22:03:08 Re: psql metaqueries with \gexec