Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-17 22:41:47
Message-ID: cbc9e160-c908-105e-cd54-6c0c69152b49@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 16.07.2018 23:55, Tomas Vondra wrote:
>
> On 07/16/2018 02:54 PM, Dean Rasheed wrote:
>> On 16 July 2018 at 13:23, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>> The top-level clauses allow us to make such deductions, with deeper
>>>>> clauses it's much more difficult (perhaps impossible). Because for
>>>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>>>> b=2)) it's not that simple, because there may be multiple combinations
>>>>> and so a match in MCV does not guarantee anything.
>>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>>> maybe that's the best that we can do in the absence of any other
>>>> stats.
>>>>
>>> Hmmm, is that actually true? Let's consider a simple example, with two
>>> columns, each with just 2 values, and a "perfect" MCV list:
>>>
>>> a | b | frequency
>>> -------------------
>>> 1 | 1 | 0.5
>>> 2 | 2 | 0.5
>>>
>>> And let's estimate sel(a=1 & b=2).
>> OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).
>>
>> What we could do though is also impose an upper bound, based on the
>> sum of non-matching MCV frequencies. In this case, the upper bound is
>> also 0, so we could actually say the resulting selectivity is 0.
>>
> Hmmm, it's not very clear to me how would we decide which of these cases
> applies, because in most cases we don't have MCV covering 100% rows.
>
> Anyways, I've been thinking about how to modify the code to wort the way
> you proposed (in a way sufficient for a PoC). But after struggling with
> it for a while it occurred to me it might be useful to do it on paper
> first, to verify how would it work on your examples.
>
> So let's use this data
>
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,50000);
> insert into foo select 1,2 from generate_series(1,40000);
> insert into foo select 1,x/10 from generate_series(30,250000) g(x);
> insert into foo select 2,1 from generate_series(1,30000);
> insert into foo select 2,2 from generate_series(1,20000);
> insert into foo select 2,x/10 from generate_series(30,500000) g(x);
> insert into foo select 3,1 from generate_series(1,10000);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,600000) g(x);
> insert into foo select x,x/10 from generate_series(4,750000) g(x);
>
> Assuming we have perfectly exact statistics, we have these MCV lists
> (both univariate and multivariate):
>
> select a, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a order by 2 desc;
>
> a | count | frequency
> --------+--------+-----------
> 3 | 614998 | 0.2727
> 2 | 549971 | 0.2439
> 1 | 339971 | 0.1508
> 1014 | 1 | 0.0000
> 57220 | 1 | 0.0000
> ...
>
> select b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by b order by 2 desc;
>
> b | count | frequency
> --------+-------+-----------
> 1 | 90010 | 0.0399
> 2 | 65010 | 0.0288
> 3 | 31 | 0.0000
> 7 | 31 | 0.0000
> ...
>
> select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a, b order by 3 desc;
>
> a | b | count | frequency
> --------+--------+-------+-----------
> 1 | 1 | 50000 | 0.0222
> 1 | 2 | 40000 | 0.0177
> 2 | 1 | 30000 | 0.0133
> 2 | 2 | 20000 | 0.0089
> 3 | 1 | 10000 | 0.0044
> 3 | 2 | 5000 | 0.0022
> 2 | 12445 | 10 | 0.0000
> ...
>
> And let's estimate the two queries with complex clauses, where the
> multivariate stats gave 2x overestimates.
>
> SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
> -- actual 90000, univariate: 24253, multivariate: 181091
>
> univariate:
>
> sel(a=1) = 0.1508
> sel(b=1) = 0.0399
> sel(b=2) = 0.0288
> sel(b=1 or b=2) = 0.0673
>
> multivariate:
> sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)
>
> The second multivariate estimate comes from assuming only the first 5
> items make it to the multivariate MCV list (covering 6.87% of the data)
> and extrapolating the selectivity to the non-MCV data too.
>
> (Notice it's about 2x the actual selectivity, so this extrapolation due
> to not realizing the MCV already contains all the matches is pretty much
> responsible for the whole over-estimate).
>
> So, how would the proposed algorithm work? Let's start with "a=1":
>
> sel(a=1) = 0.1508
>
> I don't see much point in applying the two "b" clauses independently (or
> how would it be done, as it's effectively a single clause):
>
> sel(b=1 or b=2) = 0.0673
>
> And we get
>
> total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101
>
> From the multivariate MCV we get
>
> mcv_sel = 0.0399
>
> And finally
>
> total_sel = Max(total_sel, mcv_sel) = 0.0399
>
> Which is great, but I don't see how that actually helped anything? We
> still only have the estimate for the ~7% covered by the MCV list, and
> not the remaining non-MCV part.
>
> I could do the same thing for the second query, but the problem there is
> actually exactly the same - extrapolation from MCV to non-MCV part
> roughly doubles the estimate.
>
> So unless I'm applying your algorithm incorrectly, this does not seem
> like a very promising direction :-(
>
> There may be valuable information we could learn from the univariate
> estimates (using a Max() of them as an upper boundary seems reasonable),
> but that's still quite crude. And it will only ever work with simple
> top-level clauses. Once the clauses get more complicated, it seems
> rather tricky - presumably multivariate stats would be only used for
> correlated columns, so trying to deduce something from univariate
> estimates on complex clauses on such columns seems somewhat suspicious.
>
>
> regards
>

Teodor Sigaev has proposed an alternative approach for calculating
selectivity of multicolumn join or compound index search.
Usually DBA creates compound indexes which can be used  by optimizer to
build efficient query execution plan based on index search.
We can stores statistic for compound keys of such indexes and (as it is
done now for functional indexes) and use it to estimate selectivity
of clauses. I have implemented this idea and will publish patch in
separate thread soon.
Now I just want to share some results for the Tomas examples.

So for Vanilla Postges without extra statistic  estimated number of rows
is about 4 times smaller than real.

 postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1
or b=2);
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10964.76..10964.77 rows=1 width=8) (actual
time=49.251..49.251 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=513.60..10906.48 rows=23310
width=0) (actual time=17.368..39.928 rows=90000 loops=1)
         Recheck Cond: (((a = 1) AND (b = 1)) OR ((a = 1) AND (b = 2)))
         Heap Blocks: exact=399
         ->  BitmapOr  (cost=513.60..513.60 rows=23708 width=0) (actual
time=17.264..17.264 rows=0 loops=1)
               ->  Bitmap Index Scan on foo_a_b_idx (cost=0.00..295.41
rows=13898 width=0) (actual time=10.319..10.319 rows=50000 loops=1)
                     Index Cond: ((a = 1) AND (b = 1))
               ->  Bitmap Index Scan on foo_a_b_idx (cost=0.00..206.53
rows=9810 width=0) (actual time=6.941..6.941 rows=40000 loops=1)
                     Index Cond: ((a = 1) AND (b = 2))

If we add statistic for a and b  columns:

     create statistics ab on a,b from foo;
     analyze foo;

then expected results is about 30% larger then real: 120k vs 90k:

postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1
or b=2);
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=14447.11..14447.12 rows=1 width=8) (actual
time=36.048..36.048 rows=1 loops=1)
   ->  Gather  (cost=14446.90..14447.11 rows=2 width=8) (actual
time=35.982..36.037 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=13446.90..13446.91 rows=1
width=8) (actual time=30.172..30.172 rows=1 loops=3)
               ->  Parallel Bitmap Heap Scan on foo
(cost=2561.33..13424.24 rows=9063 width=0) (actual time=15.551..26.057
rows=30000 loops=3)
                     Recheck Cond: (((a = 1) AND (b = 1)) OR ((a = 1)
AND (b = 2)))
                     Heap Blocks: exact=112
                     ->  BitmapOr  (cost=2561.33..2561.33 rows=121360
width=0) (actual time=20.304..20.304 rows=0 loops=1)
                           ->  Bitmap Index Scan on foo_a_b_idx
(cost=0.00..1488.46 rows=70803 width=0) (actual time=13.190..13.190
rows=50000 loops=1)
                                 Index Cond: ((a = 1) AND (b = 1))
                           ->  Bitmap Index Scan on foo_a_b_idx
(cost=0.00..1061.99 rows=50556 width=0) (actual time=7.110..7.110
rows=40000 loops=1)
                                 Index Cond: ((a = 1) AND (b = 2))

With compound index statistic estimation is almost equal to real value:

---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=13469.94..13469.95 rows=1 width=8) (actual
time=70.710..70.710 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=1880.20..13411.66 rows=23310
width=0) (actual time=38.776..61.050 rows=90000 loops=1)
         Recheck Cond: (((a = 1) AND (b = 1)) OR ((a = 1) AND (b = 2)))
         Heap Blocks: exact=399
         ->  BitmapOr  (cost=1880.20..1880.20 rows=88769 width=0)
(actual time=38.618..38.618 rows=0 loops=1)
               ->  Bitmap Index Scan on foo_a_b_idx (cost=0.00..1030.50
rows=49007 width=0) (actual time=26.335..26.335 rows=50000 loops=1)
                     Index Cond: ((a = 1) AND (b = 1))
               ->  Bitmap Index Scan on foo_a_b_idx (cost=0.00..838.05
rows=39762 width=0) (actual time=12.278..12.278 rows=40000 loops=1)
                     Index Cond: ((a = 1) AND (b = 2))

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-07-17 22:43:00 Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
Previous Message R, Siva 2018-07-17 22:40:12 Bug in gin insert redo code path during re-compression of empty gin data leaf pages