Re: New hashed IN code ignores distinctiveness of subquery

From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 04:50:41
Message-ID: 20030127045041.GA8600@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Sun, Jan 26, 2003 at 11:18:31PM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> > Right, or skip it entirely when selecting stuff with unique constraints.
>
> I'm hesitant to do that until we have some scheme in place for
> invalidating cached plans.

By cached, do you mean PREPARE stuff, or something else?

>
> > I don't think it is. The number of rows is correct if you do product_id
> > IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on.
>
> But that's a completely different code path; it doesn't even enter the
> routines we're concerned about here.

Yes, but its the same concept. Although we seem to be agreeing about
that now :)

> > What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do
> > that JOIN_IN doesn't?
>
> Uniqify the inner/outer path and then do a normal inner join. See
> joinpath.c.

Ah, OK. If I comment out line 547 of joinrels.c (which adds JOIN_IN to
the set of join paths) so that the UNIQUE joins are all that are left to
try, then I get:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
-> Merge Join (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
-> Sort (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
Sort Key: public.bugs.product_id
-> HashAggregate (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
Total runtime: 579.84 msec
(9 rows)

(This isn't picked without my hack, because the cost is slightly higher
than the JOIN_IN version)

However, its much faster (although not as fast as sticking the DISTINCT
in there myself), but the actual rows coming from the sort is really odd
- where is that number coming from? How can sorting 9 rows take 44476
anythings? The final mergejoin cost is still way off, too.

> regards, tom lane

Thanks,

Bradley

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2003-01-27 05:00:08 Re: New hashed IN code ignores distinctiveness of subquery
Previous Message Tom Lane 2003-01-27 04:18:31 Re: New hashed IN code ignores distinctiveness of subquery

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-01-27 05:00:08 Re: New hashed IN code ignores distinctiveness of subquery
Previous Message Tom Lane 2003-01-27 04:18:31 Re: New hashed IN code ignores distinctiveness of subquery