Quick Links

New hashed IN code ignores distinctiveness of subquery

From:	Bradley Baetz <bbaetz(at)acm(dot)org>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	New hashed IN code ignores distinctiveness of subquery
Date:	2003-01-26 12:26:12
Message-ID:	20030126122612.GA3820@mango.home
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

I've been trying out the new hased subselect code from CVS. It appears
that the planner isn't taking the distinctiveness of the values from the
subselect into account:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3485675.30..3485675.30 rows=1 width=8) (actual
time=1430065.54..1430065.55 rows=1 loops=1)
-> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
time=0.15..1429696.69 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
rows=50000 loops=1)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
rows=277884160 loops=1)
Total runtime: 1430102.08 msec
(6 rows)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT distinct product_id FROM bugs);

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3033.30..3033.30 rows=1 width=8) (actual
time=505.17..505.17 rows=1 loops=1)
-> Hash Join (cost=1959.54..3032.67 rows=251 width=8) (actual
time=282.19..456.66 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.01..68.94 rows=50000 loops=1)
-> Hash (cost=1959.52..1959.52 rows=9 width=4) (actual
time=282.14..282.14 rows=0 loops=1)
-> Subquery Scan "IN_subquery" (cost=0.00..1959.52
rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1)
-> Unique (cost=0.00..1959.52 rows=9 width=4)
(actual time=0.13..282.03 rows=9 loops=1)
-> Index Scan using bugs_product_id_idx on
bugs (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46
rows=50000 loops=1)
Total runtime: 505.30 msec
(9 rows)

bbaetz=# set enable_mergejoin=false;
SET

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4281600.45..4281600.45 rows=1 width=8) (actual
time=486.80..486.80 rows=1 loops=1)
-> Hash Join (cost=1091.00..4281586.50 rows=5580 width=8) (actual
time=146.58..425.92 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..75.73 rows=50000 loops=1)
-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.06..64.98 rows=50000 loops=1)
Total runtime: 486.91 msec
(7 rows)

bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of
the products are actually used in bugs.product_id, due to an off-by-one
error in the script I used to generate the table)

I still haven't tuned the various optimiser settings, which may explain
part of the enable_mergejoin=false result, although the DISTINCT
probably takes some time too (Side note - is it possible to notice that
DISTINCT on a column with a unique index doesn't need a Unique pass?).
However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly
useless and silly query though - I was just playing arround.

The tables have been analyzed, and there are separate unique indexes on
products.id, bugs.bug_id and bugs.product_id.

FWIW:

bbaetz=# select n_distinct, correlation FROM pg_stats WHERE
tablename='products' AND attname='product_id';
n_distinct | correlation
------------+-------------
9 | 0.0919474
(1 row)

Thanks,

Bradley

Responses

Re: New hashed IN code ignores distinctiveness of subquery at 2003-01-26 19:09:49 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2003-01-26 19:09:49	Re: New hashed IN code ignores distinctiveness of subquery
Previous Message	pgsql-bugs	2003-01-26 10:04:09	Bug #887: pg_restore blocks

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Rod Taylor	2003-01-26 13:21:52	Re: domain check constraint syntax problem for 7.4
Previous Message	Hannu Krosing	2003-01-26 09:35:24	Re: Win32 port patches submitted