Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-17 23:58:06
Message-ID: abef0126-ff75-850f-0a25-063a8ffa07ad@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 07/18/2018 12:41 AM, Konstantin Knizhnik wrote:
> ...
>
> Teodor Sigaev has proposed an alternative approach for calculating
> selectivity of multicolumn join or compound index search.
> Usually DBA creates compound indexes which can be used  by optimizer to
> build efficient query execution plan based on index search.
> We can stores statistic for compound keys of such indexes and (as it is
> done now for functional indexes) and use it to estimate selectivity
> of clauses. I have implemented this idea and will publish patch in
> separate thread soon.
> Now I just want to share some results for the Tomas examples.
>
> So for Vanilla Postges without extra statistic  estimated number of rows
> is about 4 times smaller than real.
>

Can you please post plans with parallelism disabled, and perhaps without
the aggregate? Both makes reading the plans unnecessarily difficult ...

>  postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1
> or b=2);
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>  Aggregate  (cost=10964.76..10964.77 rows=1 width=8) (actual
> time=49.251..49.251 rows=1 loops=1)
>    ->  Bitmap Heap Scan on foo  (cost=513.60..10906.48 rows=23310
> width=0) (actual time=17.368..39.928 rows=90000 loops=1)

ok, 23k vs. 90k

>
> If we add statistic for a and b  columns:
>
>      create statistics ab on a,b from foo;
>      analyze foo;
>
> then expected results is about 30% larger then real: 120k vs 90k:
>

Eh? The plan however says 9k vs. 30k ...

>                ->  Parallel Bitmap Heap Scan on foo
> (cost=2561.33..13424.24 rows=9063 width=0) (actual time=15.551..26.057
> rows=30000 loops=3)

...

> With compound index statistic estimation is almost equal to real value:
>
>    ->  Bitmap Heap Scan on foo  (cost=1880.20..13411.66 rows=23310
> width=0) (actual time=38.776..61.050 rows=90000 loops=1)

Well, this says 23k vs. 90k.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-18 00:01:05 Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Previous Message Peter Geoghegan 2018-07-17 23:39:47 Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)