Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-17 13:03:28
Message-ID: c2614611-fd99-124f-c9aa-0fef3859c78f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/17/2018 11:09 AM, Dean Rasheed wrote:
> On 16 July 2018 at 21:55, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
...
>>
>> So, how would the proposed algorithm work? Let's start with "a=1":
>>
>> sel(a=1) = 0.1508
>>
>> I don't see much point in applying the two "b" clauses independently (or
>> how would it be done, as it's effectively a single clause):
>>
>> sel(b=1 or b=2) = 0.0673
>>
>> And we get
>>
>> total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101
>>
>> From the multivariate MCV we get
>>
>> mcv_sel = 0.0399
>>
>> And finally
>>
>> total_sel = Max(total_sel, mcv_sel) = 0.0399
>>
>> Which is great, but I don't see how that actually helped anything? We
>> still only have the estimate for the ~7% covered by the MCV list, and
>> not the remaining non-MCV part.
>>
>
> Right. If these are the only stats available, and there are just 2
> top-level clauses like this, it either returns the MCV estimate, or
> the old univariate estimate (whichever is larger). It avoids
> over-estimating, but will almost certainly under-estimate when the MCV
> matches are incomplete.
>

Yeah :-(

In my experience under-estimates tend to have much worse consequences
(say a nested loop chosen by under-estimate vs. hash join chosen by
over-estimate). This certainly influenced some of the choices I've made
in this patch (extrapolation to non-MCV part is an example of that), but
I agree it's not particularly scientific approach and I'd very much want
something better.

>
>> I could do the same thing for the second query, but the problem there is
>> actually exactly the same - extrapolation from MCV to non-MCV part
>> roughly doubles the estimate.
>>
>> So unless I'm applying your algorithm incorrectly, this does not seem
>> like a very promising direction :-(
>>
>
> You could be right. Actually it's the order dependence with more than
> 2 top-level clauses that bothers me most about this algorithm. It's
> also not entirely obvious how to include histogram stats in this
> scheme.
>

I think for inequalities that's fairly simple - histograms work pretty
well for that, and I have a hunch that replacing the 0.5 estimate for
partially-matching buckets with conver_to_scalar-like logic and the
geometric mean (as you proposed) will work well enough.

For equalities it's going to be hard. The only thing I can think of at
the moment is checking if there are any matching buckets at all, and
using that to decide whether to extrapolate the MCV selectivity to the
non-MCV part or not (or perhaps to what part of the non-MCV part).

> A different approach that I have been thinking about is, in
> mcv_update_match_bitmap(), attempt to work out the probability of all
> the clauses matching and it not being an MCV value. For example, given
> a clause like a=1 whose univariate estimate we know (0.1508 in the
> above example), knowing that the MCV values account for 0.0222+0.0177
> of that, the remainder must be from non-MCV values. So we could say
> that the probability that a=1 and it not being and MCV is
> 0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
> combine these non-MCV probabilities to give an estimate of how many
> non-MCV values we need to worry about? I've not fully thought that
> through, but it might be useful.

Could we use it to derive some upper boundaries on the non-MCV part?

> The problem is, it's still likely to
> over-estimate when the MCV matches fully cover all possibilities, and
> I still don't see a way to reliably detect that case.
>

I guess we can use a histogram to limit the over-estimates, as explained
above. It may not be 100% reliable as it depends on the sample and how
exactly the buckets are formed, but it might help.

But perhaps these over-estimates are a particularly serious issue? When
you already get matches in a MCV, the number of matching rows is going
to be pretty significant. If you over-estimate 2x, so what? IMHO that's
still pretty accurate estimate.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-07-17 13:04:55 Re: PG 10: could not generate random cancel key
Previous Message Haribabu Kommi 2018-07-17 13:01:36 Re: Pluggable Storage - Andres's take