From: | Bradley Baetz <bbaetz(at)acm(dot)org> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | New hashed IN code ignores distinctiveness of subquery |
Date: | 2003-01-26 12:26:12 |
Message-ID: | 20030126122612.GA3820@mango.home |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
I've been trying out the new hased subselect code from CVS. It appears
that the planner isn't taking the distinctiveness of the values from the
subselect into account:
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3485675.30..3485675.30 rows=1 width=8) (actual
time=1430065.54..1430065.55 rows=1 loops=1)
-> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
time=0.15..1429696.69 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
rows=50000 loops=1)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
rows=277884160 loops=1)
Total runtime: 1430102.08 msec
(6 rows)
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT distinct product_id FROM bugs);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3033.30..3033.30 rows=1 width=8) (actual
time=505.17..505.17 rows=1 loops=1)
-> Hash Join (cost=1959.54..3032.67 rows=251 width=8) (actual
time=282.19..456.66 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.01..68.94 rows=50000 loops=1)
-> Hash (cost=1959.52..1959.52 rows=9 width=4) (actual
time=282.14..282.14 rows=0 loops=1)
-> Subquery Scan "IN_subquery" (cost=0.00..1959.52
rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1)
-> Unique (cost=0.00..1959.52 rows=9 width=4)
(actual time=0.13..282.03 rows=9 loops=1)
-> Index Scan using bugs_product_id_idx on
bugs (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46
rows=50000 loops=1)
Total runtime: 505.30 msec
(9 rows)
bbaetz=# set enable_mergejoin=false;
SET
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4281600.45..4281600.45 rows=1 width=8) (actual
time=486.80..486.80 rows=1 loops=1)
-> Hash Join (cost=1091.00..4281586.50 rows=5580 width=8) (actual
time=146.58..425.92 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..75.73 rows=50000 loops=1)
-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.06..64.98 rows=50000 loops=1)
Total runtime: 486.91 msec
(7 rows)
bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of
the products are actually used in bugs.product_id, due to an off-by-one
error in the script I used to generate the table)
I still haven't tuned the various optimiser settings, which may explain
part of the enable_mergejoin=false result, although the DISTINCT
probably takes some time too (Side note - is it possible to notice that
DISTINCT on a column with a unique index doesn't need a Unique pass?).
However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly
useless and silly query though - I was just playing arround.
The tables have been analyzed, and there are separate unique indexes on
products.id, bugs.bug_id and bugs.product_id.
FWIW:
bbaetz=# select n_distinct, correlation FROM pg_stats WHERE
tablename='products' AND attname='product_id';
n_distinct | correlation
------------+-------------
9 | 0.0919474
(1 row)
Thanks,
Bradley
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2003-01-26 19:09:49 | Re: New hashed IN code ignores distinctiveness of subquery |
Previous Message | pgsql-bugs | 2003-01-26 10:04:09 | Bug #887: pg_restore blocks |
From | Date | Subject | |
---|---|---|---|
Next Message | Rod Taylor | 2003-01-26 13:21:52 | Re: domain check constraint syntax problem for 7.4 |
Previous Message | Hannu Krosing | 2003-01-26 09:35:24 | Re: Win32 port patches submitted |