Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
Cc: Alexey Ermakov <alexey(dot)ermakov(at)dataegret(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
Date: 2018-09-25 20:11:15
Message-ID: 18632.1537906275@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com> writes:
> On Tue, Apr 17, 2018 at 5:15 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> =?utf-8?q?PG_Bug_reporting_form?= <noreply(at)postgresql(dot)org> writes:
>>> I'm wondering how planner estimates number of rows in that case:

>> See eqjoinsel_semi, particularly the change in behavior when it thinks
>> nd2 is or is not a default estimate.

> There are similar issue without CTE which look pretty weird:

Yeah, this is exactly the same case as Alexey's: as soon as eqjoinsel_semi
decides it's dealing with a default ndistinct estimate, it chickens out
and delivers a very middle-of-the-road selectivity (0.5, it looks like).
It's somewhat luck that the non-default path is giving you an accurate
estimate, but certainly there's no surprise in the default case being
way off.

I don't particularly want to make that logic more aggressive about
assuming it's calculating something real. The existing behavior was
put in to fix a clear bug in the other direction, see
https://www.postgresql.org/message-id/flat/201104112029.14738.uwe%40oss4u.com

However, while looking at this I had a bit of an epiphany. The inner-join
selectivity in the same cases is pretty much on-target, so is there any
way we could factor that in? Yes, there is: the size of the semijoin
output could not be more than the output of a plain inner join of the
same two relations. So it'd be legitimate to clamp our selectivity
estimate for the semijoin case to make it not more than the inner-join
estimate.

A little bit of hacking later, I have the attached patch. The bulk of
the patch is just refactoring to avoid repetitive information lookup
when we call both eqjoinsel_semi and eqjoinsel_inner. The actual
change is just to clamp eqjoinsel_semi's result, like this:

/*
* We should never estimate the output of a semijoin to be more
* rows than the equivalent inner join; it's obviously impossible
* for that to happen. The former is N1 * Psemi while the latter
* is N1 * N2 * Pinner, so we may clamp Psemi <= N2 * Pinner.
* Doing this is worthwhile because of the shakier estimation
* rules we use in eqjoinsel_semi, particularly in cases where it
* has to punt entirely.
*/
selec = Min(selec, inner_rel->rows * selec_inner);

That makes the funny behavior go away in both test cases shown in this
thread. I find one plan change in the regression tests, but it looks
reasonable enough (and checking the actual row counts shows that the
estimate moved closer to reality, not further away).

Now, there's a certain amount of garbage-in-garbage-out to this: if for
some reason the innerjoin selectivity is way off, this could do more to
hurt the semijoin estimate than to help it. But I think that generally
the semijoin numbers are much less reliable than the innerjoin numbers,
so mostly it ought to be no-change or a win.

I'll queue this up for review in the next CF.

regards, tom lane

Attachment Content-Type Size
clamp-semijoin-selectivity-1.patch text/x-diff 21.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-09-26 00:09:29 Re: BUG #15238: Sequence owner not updated when owning table is foreign
Previous Message Dilip Kumar 2018-09-25 16:45:23 Re: BUG #15401: pg_dump error

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-09-25 20:17:23 Re: Query is over 2x slower with jit=on
Previous Message Andres Freund 2018-09-25 19:50:34 Re: Query is over 2x slower with jit=on