Re: New hashed IN code ignores distinctiveness of subquery

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 02:43:18
Message-ID: 19010.1043635398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> On Sun, Jan 26, 2003 at 06:51:18PM -0500, Tom Lane wrote:
>> But the DISTINCT case is a different query. The question that's really
>> at hand is why does the planner mistakenly prefer merge to hash join in
>> the non-DISTINCT case.

> Well, its a different input, but the output is the same, so it may as
> well be the same query. IN (1,1,1) is the same as IN (1).

That's a good observation with respect to the problem of estimating the
number of output tuples; but it's irrelevant to the problem of
estimating execution cost. The plan that we're trying to cost here is
specifically a plan that runs a mergejoin *without* first uniq-ifying
the righthand relation. As such, the cost of the mergejoin is nearly
the same as the cost of an ordinary mergejoin of the two relations.
We save the costs of outputting join tuples that aren't wanted by the
IN, but we still have to run the mergejoin over those tuples.

> Well, in this case the hash node could prune duplicates as they went
> into the table :). Not sure if that would be a win, though.

We're already checking that as a separate plan alternative. The
implementation could be improved a little, though --- we could combine
the uniq-ification into the Hash node.

> I don't follow that. We have 50000 outer rows, with a selectivity of say
> 0.1 (to make the maths nicer). That gives the roughly 5000 rows printed
> above. However, thats not the whole story. The result of the join will
> be 5000 rows _for each (unique) tuple in the subselect result_. For the
> 10 distinct values I have, that gives 50000 total rows, which is
> correct.

AFAICS there are two or three different concepts of selectivity being
tossed about here. If we want to continue defining "selectivity"
as the ratio of the number of output tuples to the cross product size,
then you need different selectivity numbers depending on whether you
take the cross product size to be num_outer * num_inner or num_outer *
num_distinct_inner or just num_outer. Only in the last case will the
valid range of selectivity range from 0 to 1; if the selectivity
estimator returns something approaching 1 then the first two equations
will give very wrong answers. As an example, consider the case where
all the entries in both tables are the same value (say 42). The
present selectivity estimator *will* say 1.0, as it should for an
ordinary join. If you multiply by anything but num_outer you get the
wrong answer for the number of output tuples.

It might be that we have to bite the bullet and set up a different
selectivity estimation procedure for IN. I'd prefer to avoid that
but haven't really figured out if it's necessary or not.

In the meantime, I think you are right that it's bogus that
set_joinrel_size_estimates uses different equations for JOIN_IN and
JOIN_UNIQUE_INNER. Whatever the decision is about how to do the
estimation, these should be done the same way.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bradley Baetz 2003-01-27 03:22:57 Re: New hashed IN code ignores distinctiveness of subquery
Previous Message Bradley Baetz 2003-01-27 01:07:58 Re: New hashed IN code ignores distinctiveness of subquery

Browse pgsql-hackers by date

  From Date Subject
Next Message Bradley Baetz 2003-01-27 03:22:57 Re: New hashed IN code ignores distinctiveness of subquery
Previous Message Justin Clift 2003-01-27 02:37:38 Re: Have a PG 7.3.1 Windows (cygwin) easy installer... now