Skip site navigation (1) Skip section navigation (2)

Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, casey(dot)shobe(at)messagesystems(dot)com
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Date: 2012-01-12 02:00:03
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-bugs
On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
> >> Looks pretty bogus to me.  You're essentially assuming that the side of
> >> the join without statistics is unique, which is a mighty dubious
> >> assumption.
> > 
> > It sure is a bit dubious. But assuming that a semijoin that has max of n
> > rows on the inner side results in half of the outer sides rows (>> n) is
> > pretty bogus as well.
> How so?  There is no reason to think that the number of LHS rows with a
> match is limited by the number of RHS rows.  If we knew that the LHS
> join key was unique, then yes that'd be sensible, but we don't know
> that.
In the current example we have an estimate for the distinctness of the LHS. I 
don't see how guesstimating the RHS number of tuples in a semijoin to 
vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for 
the whole thing. Possibly it would make sense to additionally clamp the 
selectivity to an upper limit of 0.5 in that case.
I guess what I am aiming at is that overestimating the number of distinct 
tuples in the *RHS* won't lead to underestimating the number of result tuples. 
By clamping the selectivity to 0.5 in that case we can be sure not to 
overestimate more than currently.

> > SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> > is not exactly a fringe case, so I find it problematic regressing
> > quite a bit in the estimates.
> Agreed, and that's why I don't want to put in a patch that carries the
> risk of regressing even more.  I'm happy to do something that's got some
> amount of theory behind it, but if we're just guessing, we can't afford
> to guess a very high or low selectivity.
Not sure how more an estimation can regress than:
 Nested Loop  (cost=0.00..1859.02 rows=2550000 width=11) (actual 
time=0.182..11.236 rows=199 loops=1)


> One thing I've considered but not done anything about is that in a lot
> of practical cases for this, the aggregation or grouping properties of
> the sub-select would provide adequate reason for assuming its output is
> more or less unique, so that taking ndistinct equal to number of rows
> actually is sane.  But it would need a bit of thought about what
> properties we want to treat as justifying such an assumption, and then
> some code to see if the join key is a Var coming out of such a
> sub-select.  (Actually, what such a patch would probably look like is
> modifying examine_simple_variable to not just punt when it finds the
> Var came from an aggregating subquery.)
Yes, having that would be great, but be a bit more invasive than I like to 
think right now. This thing is actually a problem for people in the field...


In response to


pgsql-bugs by date

Next:From: Maxim BogukDate: 2012-01-12 04:04:37
Subject: Re: BUG #6393: cluster sometime fail under heavy concurrent write load
Previous:From: Tom LaneDate: 2012-01-12 01:24:44
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group