Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexey Ermakov <alexey(dot)ermakov(at)dataegret(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
Date: 2018-09-25 11:24:39
Message-ID: CAK-MWwR_G0NLLB9TwRhs+3cHDa9zZnYg7AgpAfeYJn_hgpZBsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Apr 17, 2018 at 5:15 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> =?utf-8?q?PG_Bug_reporting_form?= <noreply(at)postgresql(dot)org> writes:
> > I'm wondering how planner estimates number of rows in that case:
>
> See eqjoinsel_semi, particularly the change in behavior when it thinks
> nd2 is or is not a default estimate.
>
> Given the lack of statistics about the output of the WITH clause,
> it's hard to see how we'd ever get trustworthy estimates here.
> I think the fact that your first example yields an accurate
> estimate is mostly luck.
>
> regards, tom lane
>
>

There are similar issue without CTE which look pretty weird:

Good case with LIMIT 199 and adequate estimation:
hh=# explain SELECT * FROM resume WHERE resume_id IN (select id from
generate_series(1, 1000) gs(id) LIMIT 199);
QUERY PLAN
---------------------------------------------------------------------------------------------------
Nested Loop (cost=21.53..108.98 rows=199 width=519)
-> Unique (cost=21.42..21.62 rows=199 width=4)
-> Sort (cost=21.42..21.52 rows=199 width=4)
Sort Key: gs.id
-> Limit (cost=0.00..9.95 rows=199 width=4)
-> Function Scan on generate_series gs
(cost=0.00..50.00 rows=1000 width=4)
-> Index Scan using resume_pk on resume (cost=0.11..0.39 rows=1
width=519)
Index Cond: (resume_id = gs.id)

Very bad case with awful estimation (only difference LIMIT 200 vs LIMIT
199):
explain SELECT * FROM resume WHERE resume_id IN (select id from
generate_series(1, 1000) gs(id) LIMIT 200);
QUERY PLAN
---------------------------------------------------------------------------------------------------
Nested Loop (cost=21.64..109.53 rows=45860504 width=519)
-> Unique (cost=21.53..21.73 rows=200 width=4)
-> Sort (cost=21.53..21.63 rows=200 width=4)
Sort Key: gs.id
-> Limit (cost=0.00..10.00 rows=200 width=4)
-> Function Scan on generate_series gs
(cost=0.00..50.00 rows=1000 width=4)
-> Index Scan using resume_pk on resume (cost=0.11..0.39 rows=1
width=519)
Index Cond: (resume_id = gs.id)

It's not a problem by itself but once you start using this query with more
joined tables - a lot bad things happens because 5 orders of magnitude
error in selectivity estimation.

PS: in reality it forces us to use not more than 199 LIMIT in complex joins
for batch operations or the database start generate funny plans.

Regards,
Maxim

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Christoph Berg 2018-09-25 12:06:35 Re: BUG #15238: Sequence owner not updated when owning table is foreign
Previous Message PG Bug reporting form 2018-09-25 10:54:23 BUG #15399: pgAdmin 4 closes all windows instead of one

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2018-09-25 11:39:59 Re: Segfault when creating partition with a primary key and sql_drop trigger exists
Previous Message Iwata, Aya 2018-09-25 09:56:15 RE: libpq debug log