Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-17 09:09:15
Message-ID: CAEZATCX04NOAqJm7+vb_jyrMxDo44Pzx8xd_z5rRH7snsLqxEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16 July 2018 at 21:55, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> On 07/16/2018 02:54 PM, Dean Rasheed wrote:
>> On 16 July 2018 at 13:23, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>> The top-level clauses allow us to make such deductions, with deeper
>>>>> clauses it's much more difficult (perhaps impossible). Because for
>>>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>>>> b=2)) it's not that simple, because there may be multiple combinations
>>>>> and so a match in MCV does not guarantee anything.
>>>>
>>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>>> maybe that's the best that we can do in the absence of any other
>>>> stats.
>>>>
>>> Hmmm, is that actually true? Let's consider a simple example, with two
>>> columns, each with just 2 values, and a "perfect" MCV list:
>>>
>>> a | b | frequency
>>> -------------------
>>> 1 | 1 | 0.5
>>> 2 | 2 | 0.5
>>>
>>> And let's estimate sel(a=1 & b=2).
>>
>> OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).
>>
>> What we could do though is also impose an upper bound, based on the
>> sum of non-matching MCV frequencies. In this case, the upper bound is
>> also 0, so we could actually say the resulting selectivity is 0.
>>
>
> Hmmm, it's not very clear to me how would we decide which of these cases
> applies, because in most cases we don't have MCV covering 100% rows.
>
> Anyways, I've been thinking about how to modify the code to wort the way
> you proposed (in a way sufficient for a PoC). But after struggling with
> it for a while it occurred to me it might be useful to do it on paper
> first, to verify how would it work on your examples.
>
> So let's use this data
>
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,50000);
> insert into foo select 1,2 from generate_series(1,40000);
> insert into foo select 1,x/10 from generate_series(30,250000) g(x);
> insert into foo select 2,1 from generate_series(1,30000);
> insert into foo select 2,2 from generate_series(1,20000);
> insert into foo select 2,x/10 from generate_series(30,500000) g(x);
> insert into foo select 3,1 from generate_series(1,10000);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,600000) g(x);
> insert into foo select x,x/10 from generate_series(4,750000) g(x);
>
> Assuming we have perfectly exact statistics, we have these MCV lists
> (both univariate and multivariate):
>
> select a, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a order by 2 desc;
>
> a | count | frequency
> --------+--------+-----------
> 3 | 614998 | 0.2727
> 2 | 549971 | 0.2439
> 1 | 339971 | 0.1508
> 1014 | 1 | 0.0000
> 57220 | 1 | 0.0000
> ...
>
> select b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by b order by 2 desc;
>
> b | count | frequency
> --------+-------+-----------
> 1 | 90010 | 0.0399
> 2 | 65010 | 0.0288
> 3 | 31 | 0.0000
> 7 | 31 | 0.0000
> ...
>
> select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a, b order by 3 desc;
>
> a | b | count | frequency
> --------+--------+-------+-----------
> 1 | 1 | 50000 | 0.0222
> 1 | 2 | 40000 | 0.0177
> 2 | 1 | 30000 | 0.0133
> 2 | 2 | 20000 | 0.0089
> 3 | 1 | 10000 | 0.0044
> 3 | 2 | 5000 | 0.0022
> 2 | 12445 | 10 | 0.0000
> ...
>
> And let's estimate the two queries with complex clauses, where the
> multivariate stats gave 2x overestimates.
>
> SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
> -- actual 90000, univariate: 24253, multivariate: 181091
>
> univariate:
>
> sel(a=1) = 0.1508
> sel(b=1) = 0.0399
> sel(b=2) = 0.0288
> sel(b=1 or b=2) = 0.0673
>
> multivariate:
> sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)
>
> The second multivariate estimate comes from assuming only the first 5
> items make it to the multivariate MCV list (covering 6.87% of the data)
> and extrapolating the selectivity to the non-MCV data too.
>
> (Notice it's about 2x the actual selectivity, so this extrapolation due
> to not realizing the MCV already contains all the matches is pretty much
> responsible for the whole over-estimate).
>

Agreed. I think the actual MCV stats I got included the first 6
entries, but yes, that's only around 7% of the data.

> So, how would the proposed algorithm work? Let's start with "a=1":
>
> sel(a=1) = 0.1508
>
> I don't see much point in applying the two "b" clauses independently (or
> how would it be done, as it's effectively a single clause):
>
> sel(b=1 or b=2) = 0.0673
>
> And we get
>
> total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101
>
> From the multivariate MCV we get
>
> mcv_sel = 0.0399
>
> And finally
>
> total_sel = Max(total_sel, mcv_sel) = 0.0399
>
> Which is great, but I don't see how that actually helped anything? We
> still only have the estimate for the ~7% covered by the MCV list, and
> not the remaining non-MCV part.
>

Right. If these are the only stats available, and there are just 2
top-level clauses like this, it either returns the MCV estimate, or
the old univariate estimate (whichever is larger). It avoids
over-estimating, but will almost certainly under-estimate when the MCV
matches are incomplete.

> I could do the same thing for the second query, but the problem there is
> actually exactly the same - extrapolation from MCV to non-MCV part
> roughly doubles the estimate.
>
> So unless I'm applying your algorithm incorrectly, this does not seem
> like a very promising direction :-(
>

You could be right. Actually it's the order dependence with more than
2 top-level clauses that bothers me most about this algorithm. It's
also not entirely obvious how to include histogram stats in this
scheme.

A different approach that I have been thinking about is, in
mcv_update_match_bitmap(), attempt to work out the probability of all
the clauses matching and it not being an MCV value. For example, given
a clause like a=1 whose univariate estimate we know (0.1508 in the
above example), knowing that the MCV values account for 0.0222+0.0177
of that, the remainder must be from non-MCV values. So we could say
that the probability that a=1 and it not being and MCV is
0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
combine these non-MCV probabilities to give an estimate of how many
non-MCV values we need to worry about? I've not fully thought that
through, but it might be useful. The problem is, it's still likely to
over-estimate when the MCV matches fully cover all possibilities, and
I still don't see a way to reliably detect that case.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-07-17 09:24:22 Re: [HACKERS] WAL logging problem in 9.4.3?
Previous Message Michael Paquier 2018-07-17 09:03:26 Re: missing toast table for pg_policy