Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, casey(dot)shobe(at)messagesystems(dot)com
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Date: 2012-01-12 02:00:03
Message-ID: 201201120300.03698.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
> >> Looks pretty bogus to me. You're essentially assuming that the side of
> >> the join without statistics is unique, which is a mighty dubious
> >> assumption.
> >
> > It sure is a bit dubious. But assuming that a semijoin that has max of n
> > rows on the inner side results in half of the outer sides rows (>> n) is
> > pretty bogus as well.
>
> How so? There is no reason to think that the number of LHS rows with a
> match is limited by the number of RHS rows. If we knew that the LHS
> join key was unique, then yes that'd be sensible, but we don't know
> that.
In the current example we have an estimate for the distinctness of the LHS. I
don't see how guesstimating the RHS number of tuples in a semijoin to
vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for
the whole thing. Possibly it would make sense to additionally clamp the
selectivity to an upper limit of 0.5 in that case.
I guess what I am aiming at is that overestimating the number of distinct
tuples in the *RHS* won't lead to underestimating the number of result tuples.
By clamping the selectivity to 0.5 in that case we can be sure not to
overestimate more than currently.

> > SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> > is not exactly a fringe case, so I find it problematic regressing
> > quite a bit in the estimates.
>
> Agreed, and that's why I don't want to put in a patch that carries the
> risk of regressing even more. I'm happy to do something that's got some
> amount of theory behind it, but if we're just guessing, we can't afford
> to guess a very high or low selectivity.
Not sure how more an estimation can regress than:
Nested Loop (cost=0.00..1859.02 rows=2550000 width=11) (actual
time=0.182..11.236 rows=199 loops=1)

;)

> One thing I've considered but not done anything about is that in a lot
> of practical cases for this, the aggregation or grouping properties of
> the sub-select would provide adequate reason for assuming its output is
> more or less unique, so that taking ndistinct equal to number of rows
> actually is sane. But it would need a bit of thought about what
> properties we want to treat as justifying such an assumption, and then
> some code to see if the join key is a Var coming out of such a
> sub-select. (Actually, what such a patch would probably look like is
> modifying examine_simple_variable to not just punt when it finds the
> Var came from an aggregating subquery.)
Yes, having that would be great, but be a bit more invasive than I like to
think right now. This thing is actually a problem for people in the field...

Andres-

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Maxim Boguk 2012-01-12 04:04:37 Re: BUG #6393: cluster sometime fail under heavy concurrent write load
Previous Message Tom Lane 2012-01-12 01:24:44 Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct