Re: improving GROUP BY estimation

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Mark Dilger <hornschnorter(at)gmail(dot)com>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improving GROUP BY estimation
Date: 2016-03-04 13:10:36
Message-ID: 1457097036.24980.28.camel@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote:
> > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> >
> > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences).
> >
> > The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because
> >
> > (small number) + (huge number)
> > ------------------------------
> > 2
> >
> > is always much closer to the huge number. We're usually quite happy
> > when the estimates are within the same order of magnitude, so whether
> > it's K or K/2 makes pretty much no difference.
> >
> > I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Ah, OK. Haven't realized you meant geometric mean. With that it looks
like this:

1) independent

10 50 100 500 1000 5000
------------------------------------------------------------------
actual 919 3829 6244 9944 10001 10001
current 10 50 102 516 1018 4996
new 973 4001 6382 9897 9951 9951
average 491 2025 3205 5206 5484 7473
geom. mean 99 447 807 2260 3183 7051

2) dependent

10 50 100 500 1000 5000
------------------------------------------------------------------
actual 10 50 100 500 1000 5000
current 10 53 105 508 1016 5014
new 880 4105 6472 9955 10018 10018
average 445 2079 3288 5231 5517 7516
geom. mean 94 466 824 2249 3190 7087

To better illustrate the differences, we may use the "ratio error"
defined as

err(a,b) = MAX(a/b, b/a)

where 'a' is the actual value and 'b' is the estimate. That gives us:

1) independent

10 50 100 500 1000 5000
------------------------------------------------------------------
current 91.90 76.58 61.22 19.27 9.82 2.00
new 1.06 1.04 1.02 1.00 1.01 1.01
average 1.87 1.89 1.95 1.91 1.82 1.34
geom. mean 9.32 8.56 7.74 4.40 3.14 1.42

2) dependent

10 50 100 500 1000 5000
------------------------------------------------------------------
current 1.00 1.06 1.05 1.02 1.02 1.00
new 88.00 82.10 64.72 19.91 10.02 2.00
average 44.50 41.58 32.88 10.46 5.52 1.50
geom. mean 9.38 9.33 8.24 4.50 3.19 1.42

So the geometric mean seems to be working better than plain average. But
I don't think we can simply conclude we should use the geometric mean of
the estimates, as it assumes the risks of over- and under-estimating the
cardinality are the same. Because they aren't.

>
> Yes, that is what I proposed upthread. I'm not wedded to that, though.
> In general, I am with Tomas on this one, believing that his estimate
> will be much better than the current estimate. But I believe the *best*
> estimate will be somewhere between his and the current one, and I'm
> fishing for any decent way of coming up with a weighted average that
> is closer to his than to the current, but not simply equal to his proposal.
>
> The reason I want the formula to be closer to Tomas's than to the
> current is that I think that on average, across all tables, across all
> databases, in practice it will be closer to the right estimate than the
> current formula. That's just my intuition, and so I can't defend it.
> But if my intuition is right, the best formula we can adopt would be one
> that is moderated from his by a little bit, and in the direction of the
> estimate that the current code generates.

I think looking merely at what fraction of data sets (or queries) uses
dependent GROUP BY and WHERE clauses is not sufficient.

The general behavior of the two GROUP BY estimators is that the current
one tends to under-estimate, while the new one tends to over-estimate.
(It's not difficult to construct counter-examples by using more complex
dependencies between the columns etc. but let's ignore that.)

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may end up
using HashAggregate only to get killed by OOM. When we're lucky not to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected. Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

>
> I could easily lose this debate merely for lack of a principled basis
> for saying how far toward the current estimate the new estimate should
> be adjusted. The geometric average is one suggestion, but I don't have
> a principled argument for it.
>
> Like I said above, I'm fishing for a decent formula here.

Thanks for the valuable feedback!

>
> Mark Dilger

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-03-04 13:20:37 Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Previous Message Craig Ringer 2016-03-04 12:50:52 Re: Equivalent of --enable-tap-tests in MSVC scripts