Re: multivariate statistics v14

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics v14
Date: 2016-03-22 05:53:14
Message-ID: CAMkU=1zBTiYwDc8K=YmL1dmiZqXHPNbeX=svBGJCNTjaGtsiyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 20, 2016 at 4:34 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> D'oh. Thanks for reporting. Attached is v16, hopefully fixing the few
> remaining whitespace issues.

Hi Tomas,

I'm trying out v16 against a common problem, where postgresql thinks
it is likely top stop early during a "order by (index express) limit
1" but it doesn't actually stop early due to cross-column
correlations. But the multivariate statistics don't seem to help. Am
I doing this wrong, or just expecting too much?

jjanes=# create table foo as select x, floor(x/(10000000/500))::int as
y from generate_series(1,10000000) f(x);
jjanes=# create index on foo (x,y);
jjanes=# create index on foo (y,x);
jjanes=# create statistics jjj on foo (x,y) with (dependencies,histogram);
jjanes=# vacuum analyze ;

jjanes=# explain (analyze, timing off) select x from foo where y
between 478 and 480 order by x limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..4.92 rows=1 width=4) (actual rows=1 loops=1)
-> Index Only Scan using foo_x_y_idx on foo (cost=0.43..210156.55
rows=46812 width=4) (actual rows=1 loops=1)
Index Cond: ((y >= 478) AND (y <= 480))
Heap Fetches: 0
Planning time: 0.311 ms
Execution time: 478.917 ms

Here is walks up the index on x, until it meets the first row meeting
the qualification on y. It thinks it will get to stop early and be
very fast, but it doesn't.

If I add an dummy addition to the ORDER BY, to force it not to talk
the index, I get a plan which uses the other index and is actually
much faster, but is planned to be several hundred times slower:

jjanes=# explain (analyze, timing off) select x from foo where y
between 478 and 480 order by x+0 limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=1803.77..1803.77 rows=1 width=8) (actual rows=1 loops=1)
-> Sort (cost=1803.77..1920.80 rows=46812 width=8) (actual rows=1 loops=1)
Sort Key: ((x + 0))
Sort Method: top-N heapsort Memory: 25kB
-> Index Only Scan using foo_y_x_idx on foo
(cost=0.43..1569.70 rows=46812 width=8) (actual rows=60000 loops=1)
Index Cond: ((y >= 478) AND (y <= 480))
Heap Fetches: 0
Planning time: 0.175 ms
Execution time: 20.264 ms

(I use the "timing off" option, because without it the second plan
spends most of its time calling "gettimeofday")

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-03-22 05:54:33 Re: Odd system-column handling in postgres_fdw join pushdown patch
Previous Message James Sewell 2016-03-22 05:47:25 Re: Parallel Aggregate