Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-13 17:27:51
Message-ID: d6e5e15d-3f53-239a-7064-bb99f037a0da@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/13/2018 01:19 PM, Dean Rasheed wrote:
> On 24 June 2018 at 20:45, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Attached is a rebased version of this patch series, mostly just fixing
>> the breakage caused by reworked format of initial catalog data.
>>
>> Aside from that, the MCV building now adopts the logic introduced by
>> commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
>> pretty good and I don't see why multi-column MCV lists should use
>> something special.
>
> Agreed.
>
>
>> I'm sure there are plenty of open questions to discuss, particularly
>> stuff related to combining the various types of statistics to the final
>> estimate (a lot of that was already improved based on Dean's reviews).
>
> Yes, that's definitely one of the trickier parts of this. I don't
> think that the current algorithm is ideal as it stands. In particular,
> the way that it attempts to handle complex combinations of clauses
> doesn't look right. I think mcv_clauselist_selectivity() and
> histogram_clauselist_selectivity() are plausibly correct, but the way
> that the resulting selectivities are combined in
> statext_clauselist_selectivity() doesn't seem right. In particular,
> the use of estimate_equality_groups() to count "nmatches" and
> "fullmatch" only takes into account top-level equality clauses, so it
> will fail to recognise other cases like (a=1 AND (b=1 OR b=2)) which
> might be fully covered by, say, the MCV stats. Consider, for example,
> the following simple test case:
>
>
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,50000);
> insert into foo select 1,2 from generate_series(1,40000);
> insert into foo select 1,x/10 from generate_series(30,250000) g(x);
> insert into foo select 2,1 from generate_series(1,30000);
> insert into foo select 2,2 from generate_series(1,20000);
> insert into foo select 2,x/10 from generate_series(30,500000) g(x);
> insert into foo select 3,1 from generate_series(1,10000);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,600000) g(x);
> insert into foo select x,x/10 from generate_series(4,750000) g(x);
>
> create statistics foo_mcv_ab (mcv) on a,b from foo;
> analyse foo;
>
> explain analyse select * from foo where a=1 and b=1;
> -- Actual rows: 50000, Estimated: 52690 (14149 without MV-stats)
>
> explain analyse select * from foo where a=1 and b=2;
> -- Actual rows: 40000, Estimated: 41115 (10534 without MV-stats)
>
> explain analyse select * from foo where a=1 and (b=1 or b=2);
> -- Actual rows: 90000, Estimated: 181091 (24253 without MV-stats)
>
> explain analyse select * from foo where (a=1 or a=2) and (b=1 or b=2);
> -- Actual rows: 140000, Estimated: 276425 (56716 without MV-stats)
>
>
> In the first 2 queries the multivariate MCV stats help a lot and give
> good estimates, but in the last 2 queries the estimates are around
> twice as large as they should be, even though good MCV stats are
> available on those specific values.
>

I'm not so sure. The issue is that a lot of the MCV deductions depends
on whether we can answer questions like "Is there a single match?" or
"If we got a match in MCV, do we need to look at the non-MCV part?" This
is not very different from the single-column estimates, except of course
here we need to look at multiple columns.

The top-level clauses allow us to make such deductions, with deeper
clauses it's much more difficult (perhaps impossible). Because for
example with (a=1 AND b=1) there can be just a single match, so if we
find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
b=2)) it's not that simple, because there may be multiple combinations
and so a match in MCV does not guarantee anything.

I don't think there's a way around this. The single-dimensional case
does not have this issue, of course.

> The tricky thing is to work out how to correctly combine the various
> stats that are available. In the above examples, the selectivity
> returned by mcv_clauselist_selectivity() would have been basically
> correct, but since it will have not been identified as a fullmatch and
> some non-equality clauses will have been seen at the top-level (the OR
> clauses), it ends up adding on additional selectivities from
> clauselist_selectivity().
>
> I think perhaps it might be better not to attempt to combine the
> *overall* selectivity returned by mcv_clauselist_selectivity() with
> that returned by clauselist_selectivity(), but rather merge the logic
> of these two functions together into a single traversal of the
> clause-tree. That way, the various individual selectivities can be
> combined on a clause-by-clause basis to give the best running total
> based on the available information. Even when the multivariate stats
> are incomplete, they may still provide a useful lower bound on the
> selectivity.

I don't follow. The example you presented above showed multivariate
stats producing over-estimates, so how would it be helpful to use that
as lower boundary for anything?

> If/when all MCV columns have been matched exactly, that
> lower bound might turn into the appropriate overall result, if there
> is a matching MCV entry.

Isn't the problem illustrated by the examples that we don't know if the
MCV matches represent all matches, or if there may be matches in the
histogram?

> For example, suppose that there are MCV stats
> on 3 columns a,b,c and a WHERE clause like (a=1 AND b=2 AND c=3). You
> might process that something like:
>
> * Get sel(a=1) using the normal univariate stats
> * Update the multivariate MCV stats match bitmap based on a=1
> * Get sel(b=2) using the normal univariate stats
> * Compute the total selectivity assuming independence:
> total_sel = sel(a=1)*sel(b=2)
> * Update the multivariate MCV stats match bitmap based on b=2
> * Compute the multivariate MCV selectivity so far:
> mcv_sel = Sum of MCV frequencies that match so far
> * Use that as a lower bound:
> total_sel = Max(total_sel, mcv_sel)
> * Get sel(c=3) using the normal univariate stats
> * Compute the new total selectivity assuming independence:
> total_sel *= sel(c=3)
> * Update the multivariate MCV stats match bitmap based on c=3
> * Compute the new multivariate MCV selectivity:
> mcv_sel = Sum of MCV frequencies that now match
> * Use that as a new lower bound:
> total_sel = Max(total_sel, mcv_sel)
>
> so it becomes simpler to merge the selectivities, because it need only
> worry about one clause at a time, and it still makes use of partial
> information.
>

I'm not sure how this makes it any simpler? It's pretty much how we do
it now - we update the bitmaps clause-by-clause.

We can probably make better use of the univariate estimates, using them
to deduce upper/lower boundaries in various places (because the
multivariate stats are generally coarser than univariate ones).

> If there was no MCV entry for (a=1,b=2,c=3), it will still have made
> use of any MCV frequencies for (a=1,b=2) to give a somewhat better
> estimate, and it will have made use of any available univariate stats,
> which might be better under some circumstances.
>

IMHO it's quite dangerous to use MCV like this. The values that get into
MCV lists are often/usually somewhat exceptional, and using them to
estimate probability distributions on the non-MCV part is tricky.

> I think this approach generalises quite simply to arbitrary AND/OR
> combinations, and as discussed before, I don't think that it needs to
> handle NOT clauses except in the special case of a "NOT bool_var"
> clause.
>
> One drawback of this approach is that the result will depend on the
> order the clauses are processed, but maybe that's OK, given that we
> can't reasonably try all possible combinations.
>
>
>> On thing that occurred to me while comparing the single-column logic (as
>> implemented in selfuncs.c) and the new multi-column stuff, is dealing
>> with partially-matching histogram buckets.
>>
>> In the single-column case, we pretty much assume uniform distribution in
>> each bucket, and linearly interpolate the selectivity. So for a bucket
>> with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x <
>> 7" we return 0.7 and so on. This is what convert_to_scalar() does.
>>
>> In the multi-column case, we simply count each matching bucket as 0.5,
>> without any attempts to linearly interpolate. It would not be difficult
>> to call "convert_to_scalar" for each condition (essentially repeating
>> the linear interpolation for each column), but then what? We could
>> simply compute a product of those results, of course, but that only
>> works assuming independence. And that's exactly the wrong thing to
>> assume here, considering the extended statistics are meant for cases
>> where the columns are not independent.
>>
>> So I'd argue the 0.5 estimate for partially-matching buckets is the
>> right thing to do here, as it's minimizing the average error.
>
> Hmm, that seems a bit dubious to me.
>
> I think anything that tried to interpolate a value between 0 and the
> bucket frequency ought to be better, at least in cases where nearly
> all or nearly none of the bucket is matched. So then it just becomes a
> question of how best to interpolate.
>
> As you say, if the columns were independent, simply taking the product
> would probably be right, but if the columns were fully dependent on
> one another, it's not at all obvious what the best interpolation is,
> because the bucket may be long and thin, with most of the data at one
> end. However, in the absence of any other information, a reasonable
> approach might be just to take the geometric mean -- i.e., the nth
> root of the product.
>
> So perhaps a reasonable interpolation algorithm would be to take the
> product to some power, determined from an estimate of the degree of
> dependence in the histogram. I think there's enough information in the
> histogram data to get an estimate of that -- the bucket's size
> relative to the total data extents vs the bucket frequency.
>

That's an interesting idea. I'll explore doing something like that.

>
> On a different note, reading another recent thread [1] made me realise
> there's another thing this patch needs to worry about -- the new code
> needs to be calling statistic_proc_security_check() to determine
> whether it's OK to be using these statistics -- c.f. commit
> e2d4ef8de8.
> > Similarly, I think that access to pg_statistic_ext should be
> restricted in the same way that access to pg_statistic is, with a SB
> view on top. It's probably OK as it is now with just ndistinct and
> dependency degree stats, since they don't reveal actual data values,
> but the addition of MCV stats changes that.
>

Phew! Who needs security? ;-)

>
> That's it for now. I hope some of that was useful.
>

Certainly. Thanks for sharing the thoughts.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-07-13 17:28:38 Re: Cannot dump foreign key constraints on partitioned table
Previous Message Tom Lane 2018-07-13 17:22:00 Re: Problem with tupdesc in jsonb_to_recordset