Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-13 11:19:18
Message-ID: CAEZATCVa-86yMzwVYVnrvfagO07YpXPR5X1uisvVasGHd3k-4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24 June 2018 at 20:45, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Attached is a rebased version of this patch series, mostly just fixing
> the breakage caused by reworked format of initial catalog data.
>
> Aside from that, the MCV building now adopts the logic introduced by
> commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
> pretty good and I don't see why multi-column MCV lists should use
> something special.

Agreed.

> I'm sure there are plenty of open questions to discuss, particularly
> stuff related to combining the various types of statistics to the final
> estimate (a lot of that was already improved based on Dean's reviews).

Yes, that's definitely one of the trickier parts of this. I don't
think that the current algorithm is ideal as it stands. In particular,
the way that it attempts to handle complex combinations of clauses
doesn't look right. I think mcv_clauselist_selectivity() and
histogram_clauselist_selectivity() are plausibly correct, but the way
that the resulting selectivities are combined in
statext_clauselist_selectivity() doesn't seem right. In particular,
the use of estimate_equality_groups() to count "nmatches" and
"fullmatch" only takes into account top-level equality clauses, so it
will fail to recognise other cases like (a=1 AND (b=1 OR b=2)) which
might be fully covered by, say, the MCV stats. Consider, for example,
the following simple test case:

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,50000);
insert into foo select 1,2 from generate_series(1,40000);
insert into foo select 1,x/10 from generate_series(30,250000) g(x);
insert into foo select 2,1 from generate_series(1,30000);
insert into foo select 2,2 from generate_series(1,20000);
insert into foo select 2,x/10 from generate_series(30,500000) g(x);
insert into foo select 3,1 from generate_series(1,10000);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,600000) g(x);
insert into foo select x,x/10 from generate_series(4,750000) g(x);

create statistics foo_mcv_ab (mcv) on a,b from foo;
analyse foo;

explain analyse select * from foo where a=1 and b=1;
-- Actual rows: 50000, Estimated: 52690 (14149 without MV-stats)

explain analyse select * from foo where a=1 and b=2;
-- Actual rows: 40000, Estimated: 41115 (10534 without MV-stats)

explain analyse select * from foo where a=1 and (b=1 or b=2);
-- Actual rows: 90000, Estimated: 181091 (24253 without MV-stats)

explain analyse select * from foo where (a=1 or a=2) and (b=1 or b=2);
-- Actual rows: 140000, Estimated: 276425 (56716 without MV-stats)

In the first 2 queries the multivariate MCV stats help a lot and give
good estimates, but in the last 2 queries the estimates are around
twice as large as they should be, even though good MCV stats are
available on those specific values.

The tricky thing is to work out how to correctly combine the various
stats that are available. In the above examples, the selectivity
returned by mcv_clauselist_selectivity() would have been basically
correct, but since it will have not been identified as a fullmatch and
some non-equality clauses will have been seen at the top-level (the OR
clauses), it ends up adding on additional selectivities from
clauselist_selectivity().

I think perhaps it might be better not to attempt to combine the
*overall* selectivity returned by mcv_clauselist_selectivity() with
that returned by clauselist_selectivity(), but rather merge the logic
of these two functions together into a single traversal of the
clause-tree. That way, the various individual selectivities can be
combined on a clause-by-clause basis to give the best running total
based on the available information. Even when the multivariate stats
are incomplete, they may still provide a useful lower bound on the
selectivity. If/when all MCV columns have been matched exactly, that
lower bound might turn into the appropriate overall result, if there
is a matching MCV entry. For example, suppose that there are MCV stats
on 3 columns a,b,c and a WHERE clause like (a=1 AND b=2 AND c=3). You
might process that something like:

* Get sel(a=1) using the normal univariate stats
* Update the multivariate MCV stats match bitmap based on a=1
* Get sel(b=2) using the normal univariate stats
* Compute the total selectivity assuming independence:
total_sel = sel(a=1)*sel(b=2)
* Update the multivariate MCV stats match bitmap based on b=2
* Compute the multivariate MCV selectivity so far:
mcv_sel = Sum of MCV frequencies that match so far
* Use that as a lower bound:
total_sel = Max(total_sel, mcv_sel)
* Get sel(c=3) using the normal univariate stats
* Compute the new total selectivity assuming independence:
total_sel *= sel(c=3)
* Update the multivariate MCV stats match bitmap based on c=3
* Compute the new multivariate MCV selectivity:
mcv_sel = Sum of MCV frequencies that now match
* Use that as a new lower bound:
total_sel = Max(total_sel, mcv_sel)

so it becomes simpler to merge the selectivities, because it need only
worry about one clause at a time, and it still makes use of partial
information.

If there was no MCV entry for (a=1,b=2,c=3), it will still have made
use of any MCV frequencies for (a=1,b=2) to give a somewhat better
estimate, and it will have made use of any available univariate stats,
which might be better under some circumstances.

I think this approach generalises quite simply to arbitrary AND/OR
combinations, and as discussed before, I don't think that it needs to
handle NOT clauses except in the special case of a "NOT bool_var"
clause.

One drawback of this approach is that the result will depend on the
order the clauses are processed, but maybe that's OK, given that we
can't reasonably try all possible combinations.

> On thing that occurred to me while comparing the single-column logic (as
> implemented in selfuncs.c) and the new multi-column stuff, is dealing
> with partially-matching histogram buckets.
>
> In the single-column case, we pretty much assume uniform distribution in
> each bucket, and linearly interpolate the selectivity. So for a bucket
> with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x <
> 7" we return 0.7 and so on. This is what convert_to_scalar() does.
>
> In the multi-column case, we simply count each matching bucket as 0.5,
> without any attempts to linearly interpolate. It would not be difficult
> to call "convert_to_scalar" for each condition (essentially repeating
> the linear interpolation for each column), but then what? We could
> simply compute a product of those results, of course, but that only
> works assuming independence. And that's exactly the wrong thing to
> assume here, considering the extended statistics are meant for cases
> where the columns are not independent.
>
> So I'd argue the 0.5 estimate for partially-matching buckets is the
> right thing to do here, as it's minimizing the average error.

Hmm, that seems a bit dubious to me.

I think anything that tried to interpolate a value between 0 and the
bucket frequency ought to be better, at least in cases where nearly
all or nearly none of the bucket is matched. So then it just becomes a
question of how best to interpolate.

As you say, if the columns were independent, simply taking the product
would probably be right, but if the columns were fully dependent on
one another, it's not at all obvious what the best interpolation is,
because the bucket may be long and thin, with most of the data at one
end. However, in the absence of any other information, a reasonable
approach might be just to take the geometric mean -- i.e., the nth
root of the product.

So perhaps a reasonable interpolation algorithm would be to take the
product to some power, determined from an estimate of the degree of
dependence in the histogram. I think there's enough information in the
histogram data to get an estimate of that -- the bucket's size
relative to the total data extents vs the bucket frequency.

On a different note, reading another recent thread [1] made me realise
there's another thing this patch needs to worry about -- the new code
needs to be calling statistic_proc_security_check() to determine
whether it's OK to be using these statistics -- c.f. commit
e2d4ef8de8.

Similarly, I think that access to pg_statistic_ext should be
restricted in the same way that access to pg_statistic is, with a SB
view on top. It's probably OK as it is now with just ndistinct and
dependency degree stats, since they don't reveal actual data values,
but the addition of MCV stats changes that.

That's it for now. I hope some of that was useful.

Regards,
Dean

[1] https://www.postgresql.org/message-id/3876.1531261875%40sss.pgh.pa.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-07-13 12:18:46 Re: automatic restore point
Previous Message Pierre Ducroquet 2018-07-13 10:48:35 Re: function lca('{}'::ltree[]) caused DB Instance crash