Re: New hashed IN code ignores distinctiveness of subquery

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-26 19:09:49
Message-ID: 29875.1043608189@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> I've been trying out the new hased subselect code from CVS. It appears
> that the planner isn't taking the distinctiveness of the values from the
> subselect into account:

This isn't really anything to do with the new IN code, but is a
long-standing problem: cost_mergejoin doesn't apply any penalty factor
for the case where there are lots of duplicates in both inner and outer
relation (causing rescans of big chunks of the inner relation). You can
see the rescanning happening in the EXPLAIN ANALYZE output:

> -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
> time=0.15..1429696.69 rows=50000 loops=1)
> Merge Cond: ("outer".product_id = "inner".product_id)
> -> Index Scan using bugs_product_id_idx on bugs
> (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
> rows=50000 loops=1)
> -> Index Scan using bugs_product_id_idx on bugs
> (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
> rows=277884160 loops=1)
^^^^^^^^^^^^^^

277884160 rows pulled from a 50000-row relation means a heck of a lot of
rescanning went on :-(

The good news is that the system *is* aware of the small number of
distinct values in the table (note the dead-on estimate of the number of
distinct rows in your other query; which I think is from new-for-7.4
code, though the required stats have been available since 7.2).

I think it'd probably be possible to make some reasonable estimate of
the amount of rescanning required, and then inflate the mergejoin cost
estimate proportionally. I have not gotten around to looking at the
problem though. Care to take a shot at it?

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Neil Conway 2003-01-26 22:16:54 Re: Bug #883: explain analyze causes postgres to die
Previous Message Bradley Baetz 2003-01-26 12:26:12 New hashed IN code ignores distinctiveness of subquery

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2003-01-26 20:18:41 Re: Have a PG 7.3.1 Windows (cygwin) easy installer... now what to do with it?
Previous Message Ross J. Reedstrom 2003-01-26 16:17:43 Re: Can we revisit the thought of PostgreSQL 7.2.4?