BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: alexey(dot)ermakov(at)dataegret(dot)com
Subject: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
Date: 2018-04-17 09:40:50
Message-ID: 152395805004.19366.3107109716821067806@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

The following bug has been logged on the website:

Bug reference: 15160
Logged by: Alexey Ermakov
Email address: alexey(dot)ermakov(at)dataegret(dot)com
PostgreSQL version: 10.3
Operating system: Linux
Description:

Hello,

I'm wondering how planner estimates number of rows in that case:

create table test_in (id int primary key);
insert into test_in select id from generate_series(1,1000000) gs(id);
analyze test_in;

explain analyze with ids as (select id from generate_series(1,1000) gs(id)
limit 200)
select * from test_in where id in (select id from ids);
-------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=6.93..139.79 rows=200 width=4) (actual time=0.129..0.400
rows=200 loops=1)
CTE ids
-> Limit (cost=0.00..2.00 rows=200 width=4) (actual time=0.050..0.066
rows=200 loops=1)
-> Function Scan on generate_series gs (cost=0.00..10.00
rows=1000 width=4) (actual time=0.050..0.057 rows=200 loops=1)
-> HashAggregate (cost=4.50..6.50 rows=200 width=4) (actual
time=0.117..0.133 rows=200 loops=1)
Group Key: ids.id
-> CTE Scan on ids (cost=0.00..4.00 rows=200 width=4) (actual
time=0.051..0.086 rows=200 loops=1)
-> Index Only Scan using test_in_pkey on test_in (cost=0.42..0.66
rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=200)
Index Cond: (id = ids.id)
Heap Fetches: 200
Planning time: 0.128 ms
Execution time: 0.434 ms

explain analyze with ids as (select id from generate_series(1,1000) gs(id)
limit 201)
select * from test_in where id in (select id from ids);
-------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=6.96..132.78 rows=500000 width=4) (actual
time=0.119..0.389 rows=201 loops=1)
CTE ids
-> Limit (cost=0.00..2.01 rows=201 width=4) (actual time=0.048..0.064
rows=201 loops=1)
-> Function Scan on generate_series gs (cost=0.00..10.00
rows=1000 width=4) (actual time=0.048..0.056 rows=201 loops=1)
-> HashAggregate (cost=4.52..6.52 rows=200 width=4) (actual
time=0.113..0.130 rows=201 loops=1)
Group Key: ids.id
-> CTE Scan on ids (cost=0.00..4.02 rows=201 width=4) (actual
time=0.049..0.083 rows=201 loops=1)
-> Index Only Scan using test_in_pkey on test_in (cost=0.42..0.66
rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=201)
Index Cond: (id = ids.id)
Heap Fetches: 201
Planning time: 0.068 ms
Execution time: 0.417 ms

please note that it first example we got correct estimate of total number of
rows - 200, but in last one where CTE returned 201 rows (instead of 200) we
estimate total number of rows as 500000 (half of the table test_in).
which is way off and could lead to non optimal plan and poor performance.
I have same estimate if I replace IN clause with equivalent EXISTS subquery;
normal join estimates number of rows fine (but it's not equivalent in
general case when table_in.id is not unique).
reproduced in 9.5, 9.6 and 10. interesting thing that in postgresql 10
threshold is 200 rows but in previous version it's 199.
I suspect selectivity 0.5 we somehow get inside
compute_semi_anti_join_factors function in costsize.c but I'm not sure.

Thanks,
Alexey Ermakov

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2018-04-17 12:11:21 Re: Standby corruption after master is restarted
Previous Message Emre Hasegeli 2018-04-17 08:55:22 Re: Standby corruption after master is restarted

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2018-04-17 09:43:51 Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
Previous Message Amit Langote 2018-04-17 09:00:36 Re: ON CONFLICT DO UPDATE for partitioned tables