Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org, casey(dot)shobe(at)messagesystems(dot)com
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Date: 2012-01-12 01:24:44
Message-ID: 1378.1326331484@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Andres Freund <andres(at)anarazel(dot)de> writes:
> On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>> Looks pretty bogus to me. You're essentially assuming that the side of
>> the join without statistics is unique, which is a mighty dubious
>> assumption.

> It sure is a bit dubious. But assuming that a semijoin that has max of n rows
> on the inner side results in half of the outer sides rows (>> n) is pretty
> bogus as well.

How so? There is no reason to think that the number of LHS rows with a
match is limited by the number of RHS rows. If we knew that the LHS
join key was unique, then yes that'd be sensible, but we don't know
that.

> SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> is not exactly a fringe case, so I find it problematic regressing
> quite a bit in the estimates.

Agreed, and that's why I don't want to put in a patch that carries the
risk of regressing even more. I'm happy to do something that's got some
amount of theory behind it, but if we're just guessing, we can't afford
to guess a very high or low selectivity.

One thing I've considered but not done anything about is that in a lot
of practical cases for this, the aggregation or grouping properties of
the sub-select would provide adequate reason for assuming its output is
more or less unique, so that taking ndistinct equal to number of rows
actually is sane. But it would need a bit of thought about what
properties we want to treat as justifying such an assumption, and then
some code to see if the join key is a Var coming out of such a
sub-select. (Actually, what such a patch would probably look like is
modifying examine_simple_variable to not just punt when it finds the
Var came from an aggregating subquery.)

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2012-01-12 02:00:03 Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Previous Message Andres Freund 2012-01-12 00:53:02 Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct