Re: 200 = 199 + 1?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Tiikkaja <marko(at)joh(dot)to>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 200 = 199 + 1?
Date: 2017-09-28 18:21:12
Message-ID: 4253.1506622872@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Marko Tiikkaja <marko(at)joh(dot)to> writes:
> On Wed, Sep 27, 2017 at 5:45 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Looking at it another way, the main thing that the combination of hashagg
>> outer path + indexscan inner path knows that eqjoinsel_semi didn't account
>> for is that there's a unique index on foo.id. But that info is available
>> to eqjoinsel_semi, in the sense that it's been given a nondefault estimate
>> that nd1 is equal to the outer relation size. So the mistake that it's
>> making is to throw up its hands and use an 0.5 selectivity estimate just
>> because it has no info about the inner relation. I think if we'd pushed
>> through the nd2/nd1 calculation after setting nd2 = size of inner rel,
>> we'd end up with an estimate matching the product of these path sizes.
>> (Caution: inadequate caffeine absorbed yet, this might be all wrong.)

> This sounds very reasonable to me.

I experimented a bit with the attached patch, which modifies
eqjoinsel_semi in two ways. First, if the number-of-distinct-values
estimate for the inner rel is just a default rather than having any
real basis, it replaces it with inner_rel->rows, effectively assuming
that the inside of the IN or EXISTS is unique. Second, it drops the
fallback to selectivity 0.5 altogether, just applying the nd1 vs nd2
heuristic all the time. This gets rid of the discontinuity of behavior
at 200 estimated distinct values. The behavior in your example is
maybe a bit funny-looking: for LIMITs above 200 you get something like

Nested Loop (cost=7.18..869.25 rows=300 width=4)
-> HashAggregate (cost=6.75..8.75 rows=200 width=4)
Group Key: i.i
-> Limit (cost=0.00..3.00 rows=300 width=4)
-> Function Scan on generate_series i (cost=0.00..10.00 rows=1000 width=4)
-> Index Only Scan using foo_pkey on foo (cost=0.42..4.30 rows=1 width=4)
Index Cond: (id = i.i)

The reported rowcount for the HashAggregate is still 200, which
is out of sync with what we did at the join level. I think that's
just a cosmetic thing though, and am disinclined to try to jury-rig
something to make it look different.

The change causes one plan change that's visible in the regression test
outputs; that change seems OK so I've just included an expected-output
change below.

This could use somebody trying harder to break it than I have. It might
also be wise to dig into the list archives and see what prompted the
inclusion of the don't-use-the-equations-with-default-ndistinct logic
in the first place.

regards, tom lane

Attachment Content-Type Size
semijoin-selectivity-with-default-ndistinct-1.patch text/x-diff 5.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-09-28 18:34:35 Re: pgsql: Fix freezing of a dead HOT-updated tuple
Previous Message Bossart, Nathan 2017-09-28 17:44:44 Re: [Proposal] Allow users to specify multiple tables in VACUUM commands