Re: improving GROUP BY estimation

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improving GROUP BY estimation
Date: 2016-03-03 19:42:58
Message-ID: E00A6146-D40C-4E4E-92B2-D6F5017F9B19@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
> On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences).
>
> The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because
>
> (small number) + (huge number)
> ------------------------------
> 2
>
> is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of magnitude, so whether it's K or K/2 makes pretty much no difference.
>
> I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Yes, that is what I proposed upthread. I'm not wedded to that, though.
In general, I am with Tomas on this one, believing that his estimate
will be much better than the current estimate. But I believe the *best*
estimate will be somewhere between his and the current one, and I'm
fishing for any decent way of coming up with a weighted average that
is closer to his than to the current, but not simply equal to his proposal.

The reason I want the formula to be closer to Tomas's than to the
current is that I think that on average, across all tables, across all
databases, in practice it will be closer to the right estimate than the
current formula. That's just my intuition, and so I can't defend it.
But if my intuition is right, the best formula we can adopt would be one
that is moderated from his by a little bit, and in the direction of the
estimate that the current code generates.

I could easily lose this debate merely for lack of a principled basis
for saying how far toward the current estimate the new estimate should
be adjusted. The geometric average is one suggestion, but I don't have
a principled argument for it.

Like I said above, I'm fishing for a decent formula here.

Mark Dilger

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-03-03 20:17:44 Re: postgres_fdw vs. force_parallel_mode on ppc
Previous Message Kevin Grittner 2016-03-03 19:40:50 Re: snapshot too old, configured by time