Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-15 09:36:11
Message-ID: CAEZATCUtvWzQ8BPgNcY8u6tnyAf6cvVoVC6RRdqbMuDNHfuS2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13 July 2018 at 18:27, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I'm not so sure. The issue is that a lot of the MCV deductions depends
> on whether we can answer questions like "Is there a single match?" or
> "If we got a match in MCV, do we need to look at the non-MCV part?" This
> is not very different from the single-column estimates, except of course
> here we need to look at multiple columns.
>
> The top-level clauses allow us to make such deductions, with deeper
> clauses it's much more difficult (perhaps impossible). Because for
> example with (a=1 AND b=1) there can be just a single match, so if we
> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
> b=2)) it's not that simple, because there may be multiple combinations
> and so a match in MCV does not guarantee anything.

Actually, it guarantees a lower bound on the overall selectivity, and
maybe that's the best that we can do in the absence of any other
stats.

I did wonder if maybe we could do better by tracking allowed value
counts. E.g., with a clause like ((a=1 OR a=2) AND (b=1 OR b=2)) it
would be fairly simple to see that there are 2 allowed values of a,
and 2 allowed values of b, so 4 allowed values overall. If we had,
say, 3 MCV matches, we'd then know to factor in something extra for
the 1 non-MCV match. I'm not sure what to do with non-equality clauses
though.

>> I think perhaps it might be better not to attempt to combine the
>> *overall* selectivity returned by mcv_clauselist_selectivity() with
>> that returned by clauselist_selectivity(), but rather merge the logic
>> of these two functions together into a single traversal of the
>> clause-tree. That way, the various individual selectivities can be
>> combined on a clause-by-clause basis to give the best running total
>> based on the available information. Even when the multivariate stats
>> are incomplete, they may still provide a useful lower bound on the
>> selectivity.
>
> I don't follow. The example you presented above showed multivariate
> stats producing over-estimates, so how would it be helpful to use that
> as lower boundary for anything?

No, the multivariate MCV stats were producing good estimates, even for
the complex clauses, because they were all common values in my
example. The problem was that the good MCV estimate was then being
ruined by adding on extra factors because at the top-level it didn't
appear to be a full match.

>> If/when all MCV columns have been matched exactly, that
>> lower bound might turn into the appropriate overall result, if there
>> is a matching MCV entry.
>
> Isn't the problem illustrated by the examples that we don't know if the
> MCV matches represent all matches, or if there may be matches in the
> histogram?

The example illustrated a case where the MCV matches represented all
the matches, but we failed to recognise that. Now we could fix that to
reliably detect cases where the MCV matches represented all the
matches, but it's still not entirely obvious what to do when they
don't.

What I'm considering is an algorithm where we simultaneously compute 3 things:

simple_sel - The result we would get without multivariate stats (*)
mcv_sel - The multivariate MCV result
hist_sel - The multivariate histogram result

(*) except that at each stage where we add a new clause to the
simple_sel value, we improve upon that estimate by factoring in a
lower bound from the multivariate stats so far, so that even if the
multivariate stats fail to generate anything at the end, we've managed
to account for some of the non-independence of the columns.

If the MCV matches represented all the matches, then at the end we
would have simple_sel = mcv_sel, and hist_sel = 0, and we'd be done.

Otherwise, we'd have simple_sel >= mcv_sel, and a possibly non-zero
hist_sel, but if we had managed to factor in both mcv_sel and hist_sel
to simple_sel at each stage as we went along, then simple_sel is the
best overall estimate we can return.

Perhaps this is not so very different from what you're currently
doing, except that with this approach we might also end up with
mcv_sel = 0 and hist_sel = 0, but still have an improved simple_sel
estimate that had accounted for some the multivariate stats available
along the way.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2018-07-15 10:32:59 Re: make installcheck-world in a clean environment
Previous Message David Rowley 2018-07-15 08:34:32 Re: Speeding up INSERTs and UPDATEs to partitioned tables