## Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From: Tom Lane Andres Freund pgsql-bugs(at)postgresql(dot)org, casey(dot)shobe(at)messagesystems(dot)com Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct 2012-02-16 00:24:59 26313.1329351899@sss.pgh.pa.us (view raw or whole thread) 2012-01-11 20:10:40 from Andres Freund  2012-01-12 00:01:01 from Tom Lane   2012-01-12 00:40:34 from Andres Freund    2012-01-12 01:24:44 from Tom Lane     2012-01-12 02:00:03 from Andres Freund      2012-02-16 00:24:59 from Tom Lane    2012-01-12 18:03:25 from Casey Allen Shobe   2012-01-12 00:53:02 from Andres Freund pgsql-bugs
```[ getting back to this after assorted distractions ]

Andres Freund <andres(at)anarazel(dot)de> writes:
> On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
>> Andres Freund <andres(at)anarazel(dot)de> writes:
>>> On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>>>> Looks pretty bogus to me.  You're essentially assuming that the side of
>>>> the join without statistics is unique, which is a mighty dubious
>>>> assumption.

>>> It sure is a bit dubious. But assuming that a semijoin that has max of n
>>> rows on the inner side results in half of the outer sides rows (>> n) is
>>> pretty bogus as well.

> In the current example we have an estimate for the distinctness of the LHS. I
> don't see how guesstimating the RHS number of tuples in a semijoin to
> vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for
> the whole thing.

The real problem is that the estimate we want to use involves the ratio
nd2/nd1.  If either of those numbers is mere fantasy, so is the ratio.
It doesn't really help to say "well, we can upper-bound this number
here", because sometimes a too large result is as bad as too small.

>> One thing I've considered but not done anything about is that in a lot
>> of practical cases for this, the aggregation or grouping properties of
>> the sub-select would provide adequate reason for assuming its output is
>> more or less unique, so that taking ndistinct equal to number of rows
>> actually is sane.  But it would need a bit of thought about what
>> properties we want to treat as justifying such an assumption, and then
>> some code to see if the join key is a Var coming out of such a
>> sub-select.  (Actually, what such a patch would probably look like is
>> modifying examine_simple_variable to not just punt when it finds the
>> Var came from an aggregating subquery.)

> Yes, having that would be great, but be a bit more invasive than I like to
> think right now. This thing is actually a problem for people in the field...

I'm not sure it's that bad.  Attached is a simple patch to notice that
the subquery is "SELECT DISTINCT foo" and if so assume the result is
unique.  This takes care of at least the first of your original
examples.  I'm not sure whether fixing this single case is enough to get
excited about, but it's a step forward anyway.

Your second example, involving a WITH, would properly be handled by
teaching examine_simple_variable to drill down into CTEs.  I lacked the
round tuits to make it do so initially, and still do, but if you'd like
to have a go at it ...

regards, tom lane

```

Attachment: notice-distinct-subquery.patch
Description: text/x-patch (2.6 KB)

### pgsql-bugs by date

 Next: From: zoulx1982 Date: 2012-02-17 05:25:43 Subject: BUG #6460: routine my_log2 use incorrect data type ? Previous: From: yannick godbout Date: 2012-02-15 23:52:06 Subject: probleme d'installation