Weirdly pesimistic estimates in optimizer

From: David Kubečka <davidkubecka(at)seznam(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Weirdly pesimistic estimates in optimizer
Date: 2015-03-02 19:13:51
Message-ID: CAGLMwk-C5y8yFQDY7XmPEcKAx6C1r4AOcc1zLKakmGCHwYmhZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I have encountered a performance problem with relatively simple query,
which I think is caused by the overly pesimistic estimates in optimizer. I
have originally run into this issue on a table few GBs large, but it can be
reproduced with much smaller table as follows:

-- Setup main fact table
create table facts (fk int, f numeric);
insert into facts select (random()*10000)::int, random()*10000 from
generate_series(1,1000000) g(i);
create index on facts (fk);
vacuum analyze facts;

-- Pick 100 values of 'fk' which cover roughly 1/100 rows from the 'facts'
table and select all corresponding values
create table random_fk_dupl as select fk from facts where fk in (select
(random()*10000)::int from generate_series(1,100) g(i));
vacuum analyze random_fk_dupl;

-- Create a table with unique values from 'random_fk_dupl'
create table random_fk_uniq as select distinct fk from random_fk_dupl;
vacuum analyze random_fk_uniq;

(The motivation for doing this is that I have a persistent table/cache with
aggregated values and I want to update this table whenever new data are
loaded to 'facts' by processing only loaded data. This is very rough
description but it isn't needed to go into more detail for the sake of the
argument.)

Now the main query:

david=# explain analyze select fk, sum(f) from facts where fk in (select fk
from random_fk_dupl) group by 1;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=892.82..1017.34 rows=9962 width=15) (actual
time=22.752..22.791 rows=100 loops=1)
Group Key: facts.fk
-> Nested Loop (cost=167.01..842.63 rows=10038 width=15) (actual
time=2.167..16.739 rows=9807 loops=1)
-> HashAggregate (cost=166.59..167.59 rows=100 width=4) (actual
time=2.153..2.193 rows=100 loops=1)
Group Key: random_fk_dupl.fk
-> Seq Scan on random_fk_dupl (cost=0.00..142.07 rows=9807
width=4) (actual time=0.003..0.688 rows=9807 loops=1)
-> Index Scan using facts_fk_idx on facts (cost=0.42..5.75
rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
Index Cond: (fk = random_fk_dupl.fk)
Planning time: 0.372 ms
Execution time: 22.842 ms

This is the plan I would expect given the estimates (quite good except the
last aggregation) for this query and it exactly corresponds to what's
written in the README of optimizer:

{quote}
The naive way to join two relations using a clause like WHERE A.X = B.Y
is to generate a nestloop plan like this:

NestLoop
Filter: A.X = B.Y
-> Seq Scan on A
-> Seq Scan on B

We can make this better by using a merge or hash join, but it still
requires scanning all of both input relations. If A is very small and B is
very large, but there is an index on B.Y, it can be enormously better to do
something like this:

NestLoop
-> Seq Scan on A
-> Index Scan using B_Y_IDX on B
Index Condition: B.Y = A.X

Here, we are expecting that for each row scanned from A, the nestloop
plan node will pass down the current value of A.X into the scan of B.
That allows the indexscan to treat A.X as a constant for any one
invocation, and thereby use it as an index key. This is the only plan type
that can avoid fetching all of B, and for small numbers of rows coming from
A, that will dominate every other consideration. (As A gets larger, this
gets less attractive, and eventually a merge or hash join will win instead.
So we have to cost out all the alternatives to decide what to do.)
{quote}

So far so good. Now let's use 'random_fk_uniq' instead of 'random_fk_dupl'.
I would expect almost the same plan (with just cheaper inner
HashAggregate), because the optimizer knows that thar there are (at most)
100 values in 'random_fk_uniq' so nested loop is clearly the best choice.
Instead we get this:

david=# explain analyze select fk, sum(f) from facts where fk in (select fk
from random_fk_uniq) group by 1;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=18198.11..18322.64 rows=9962 width=15) (actual
time=163.738..163.773 rows=100 loops=1)
Group Key: facts.fk
-> Hash Semi Join (cost=3.25..18147.92 rows=10038 width=15) (actual
time=0.298..160.271 rows=9807 loops=1)
Hash Cond: (facts.fk = random_fk_uniq.fk)
-> Seq Scan on facts (cost=0.00..15408.00 rows=1000000 width=15)
(actual time=0.010..66.524 rows=1000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=4) (actual
time=0.079..0.079 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 4kB
-> Seq Scan on random_fk_uniq (cost=0.00..2.00 rows=100
width=4) (actual time=0.003..0.028 rows=100 loops=1)
Planning time: 1.021 ms
Execution time: 163.911 ms

Even with our small data this is significantly slower and the relative
difference gets much bigger with larger data. Why optimizer chooses such an
ineffective plan? Because it apparently thinks it's the cheapest one. So
let's force it to use nested loop in order to see the cost for that
algorithm.

david=# set enable_hashjoin to off;
SET
david=# explain analyze select fk, sum(f) from facts where fk in (select fk
from random_fk_uniq) group by 1;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=20063.48..20188.01 rows=9962 width=15) (actual
time=25.502..25.545 rows=100 loops=1)
Group Key: facts.fk
-> Nested Loop (cost=5.16..20013.29 rows=10038 width=15) (actual
time=0.159..19.748 rows=9807 loops=1)
-> HashAggregate (cost=2.25..3.25 rows=100 width=4) (actual
time=0.032..0.068 rows=100 loops=1)
Group Key: random_fk_uniq.fk
-> Seq Scan on random_fk_uniq (cost=0.00..2.00 rows=100
width=4) (actual time=0.004..0.011 rows=100 loops=1)
-> Bitmap Heap Scan on facts (cost=5.16..199.11 rows=100
width=15) (actual time=0.042..0.168 rows=98 loops=100)
Recheck Cond: (fk = random_fk_uniq.fk)
Heap Blocks: exact=9745
-> Bitmap Index Scan on facts_fk_idx (cost=0.00..5.13
rows=100 width=0) (actual time=0.024..0.024 rows=98 loops=100)
Index Cond: (fk = random_fk_uniq.fk)
Planning time: 0.702 ms
Execution time: 25.660 ms

Ok, so the overall cost of nested loop is 20013 whereas for hash join it's
only 18147. Let's finally force exactly the same plan as for query using
'random_fk_dupl'.

david=# set enable_bitmapscan to off;
SET
david=# explain analyze select fk, sum(f) from facts where fk in (select fk
from random_fk_uniq) group by 1;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=21578.55..21703.07 rows=9962 width=15) (actual
time=17.846..17.893 rows=100 loops=1)
Group Key: facts.fk
-> Nested Loop (cost=0.42..21528.36 rows=10038 width=15) (actual
time=0.014..13.069 rows=9807 loops=1)
-> HashAggregate (cost=2.25..3.25 rows=100 width=4) (actual
time=0.032..0.068 rows=100 loops=1)
Group Key: random_fk_uniq.fk
-> Seq Scan on random_fk_uniq (cost=0.00..2.00 rows=100
width=4) (actual time=0.004..0.011 rows=100 loops=1)
-> Index Scan using facts_fk_idx on facts (cost=0.42..214.26
rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)
Index Cond: (fk = random_fk_uniq.fk)
Planning time: 0.257 ms
Execution time: 17.943 ms

The question is why optimizer, or rather the cost estimator, produced so
much different estimates upon very small change in input. Moreover it seems
that the main culprit of bad estimates isn't actually directly related to
outer table, but rather to inner table. Just compare estimates for the two
index scans:

With 'random_fk_dupl':
-> Index Scan using facts_fk_idx on facts (cost=0.42..5.75
rows=100 width=15) (actual time=0.009..0.117 rows=98 loops=100)
With 'random_fk_uniq':
-> Index Scan using facts_fk_idx on facts (cost=0.42..214.26
rows=100 width=15) (actual time=0.007..0.109 rows=98 loops=100)

I have read the optimizer README file and also looked briefly at the code,
but this seems to be something not related to particular implementation of
algorithm (e.g. nested loop). Perhaps it's the way how cost estimates are
propagated down (or sideways? that would be weird...) the query tree. But I
am really not sure, since this is my first time lookng at the optimizer
code base. I should also add that I have reproduced this behaviour for all
versions of Pg from 9.2 up to current devel.

Regards,

David

Browse pgsql-hackers by date

  From Date Subject
Next Message David Kubečka 2015-03-02 19:19:31 Weirdly pesimistic estimates in optimizer
Previous Message Alvaro Herrera 2015-03-02 18:37:10 Re: alter user/role CURRENT_USER