join vs. subquery

From: David Haas <dave(at)modelpredictivesystems(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: join vs. subquery
Date: 2005-02-22 06:47:59
Message-ID: a852bfe3b5d0e6099606d553291b674f@modelpredictivesystems.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi -

This is based on a discussion I was having with neilc on IRC. He
suggested I post it here. Sorry for the length - I'm including
everything he requested.

I'm comparing the speeds of the following two queries. I was curious
why query 1 was faster than query 2:

query 1:
Select layer_number
FROM batch_report_index
WHERE device_id = (SELECT device_id FROM device_index WHERE device_name
='CP8M')
AND technology_id = (SELECT technology_id FROM technology_index WHERE
technology_name = 'G12');

query 2:
Select b.layer_number
FROM batch_report_index b, device_index d, technology_index t
WHERE b.device_id = d.device_id
AND b.technology_id = t.technology_id
AND d.device_name = 'CP8M'
AND t.technology_name = 'G12';

Here were my first runs:
(query 1 explain analyze)
Seq Scan on batch_report_index (cost=6.05..12370.66 rows=83 width=4)
(actual time=19.274..1903.110 rows=61416 loops=1)
Filter: ((device_id = $0) AND (technology_id = $1))
InitPlan
-> Index Scan using device_index_device_name_key on device_index
(cost=0.00..4.88 rows=1 width=4) (actual time=0.310..0.320 rows=1
loops=1)
Index Cond: (device_name = 'CP8M'::text)
-> Seq Scan on technology_index (cost=0.00..1.18 rows=1 width=4)
(actual time=0.117..0.149 rows=1 loops=1)
Filter: (technology_name = 'G12'::text)
Total runtime: 1947.896 ms
(8 rows)

(query 2 explain analyze)
Hash Join (cost=6.06..12380.70 rows=46 width=4) (actual
time=35.509..2831.685 rows=61416 loops=1)
Hash Cond: ("outer".technology_id = "inner".technology_id)
-> Hash Join (cost=4.88..12375.87 rows=638 width=8) (actual
time=34.584..2448.862 rows=61416 loops=1)
Hash Cond: ("outer".device_id = "inner".device_id)
-> Seq Scan on batch_report_index b (cost=0.00..10182.74
rows=436374 width=12) (actual time=0.100..1373.085 rows=436374 loops=1)
-> Hash (cost=4.88..4.88 rows=1 width=4) (actual
time=0.635..0.635 rows=0 loops=1)
-> Index Scan using device_index_device_name_key on
device_index d (cost=0.00..4.88 rows=1 width=4) (actual
time=0.505..0.520 rows=1 loops=1)
Index Cond: (device_name = 'CP8M'::text)
-> Hash (cost=1.18..1.18 rows=1 width=4) (actual time=0.348..0.348
rows=0 loops=1)
-> Seq Scan on technology_index t (cost=0.00..1.18 rows=1
width=4) (actual time=0.198..0.239 rows=1 loops=1)
Filter: (technology_name = 'G12'::text)
Total runtime: 2872.252 ms
(12 rows)

On neilc's suggestion, I did a vacuum analyze, then turned off hash
joins. Here's query 2, no hash joins:

(query 2 explain analyze)
Nested Loop (cost=0.00..15651.44 rows=46 width=4) (actual
time=22.079..2741.103 rows=61416 loops=1)
Join Filter: ("inner".technology_id = "outer".technology_id)
-> Seq Scan on technology_index t (cost=0.00..1.18 rows=1 width=4)
(actual time=0.178..0.218 rows=1 loops=1)
Filter: (technology_name = 'G12'::text)
-> Nested Loop (cost=0.00..15642.29 rows=638 width=8) (actual
time=21.792..2530.470 rows=61416 loops=1)
Join Filter: ("inner".device_id = "outer".device_id)
-> Index Scan using device_index_device_name_key on
device_index d (cost=0.00..4.88 rows=1 width=4) (actual
time=0.331..0.346 rows=1 loops=1)
Index Cond: (device_name = 'CP8M'::text)
-> Seq Scan on batch_report_index b (cost=0.00..10182.74
rows=436374 width=12) (actual time=0.070..1437.938 rows=436374 loops=1)
Total runtime: 2782.628 ms
(10 rows)

He then suggested I turn hash_joins back on and put an index on the
batch_report_table's device_id. Here's query 2 again:

(query 2 explain analyze)
Hash Join (cost=1.18..2389.06 rows=46 width=4) (actual
time=1.562..2473.554 rows=61416 loops=1)
Hash Cond: ("outer".technology_id = "inner".technology_id)
-> Nested Loop (cost=0.00..2384.24 rows=638 width=8) (actual
time=0.747..2140.160 rows=61416 loops=1)
-> Index Scan using device_index_device_name_key on
device_index d (cost=0.00..4.88 rows=1 width=4) (actual
time=0.423..0.435 rows=1 loops=1)
Index Cond: (device_name = 'CP8M'::text)
-> Index Scan using b_r_device_index on batch_report_index b
(cost=0.00..2365.82 rows=1083 width=12) (actual time=0.288..1868.118
rows=61416 loops=1)
Index Cond: (b.device_id = "outer".device_id)
-> Hash (cost=1.18..1.18 rows=1 width=4) (actual time=0.359..0.359
rows=0 loops=1)
-> Seq Scan on technology_index t (cost=0.00..1.18 rows=1
width=4) (actual time=0.198..0.237 rows=1 loops=1)
Filter: (technology_name = 'G12'::text)
Total runtime: 2515.950 ms
(11 rows)

He then suggested I increase the statistics on batch_report_index & run
the query again. I "set statistics" for both
the device_id and technology_id column to 900, vacuum analyzed, and
re-ran the query (it's still slower than query
1 run after the same contortions ):

(query 2 explain analyze)
Hash Join (cost=1.18..1608.49 rows=46 width=4) (actual
time=1.437..1499.414 rows=61416 loops=1)
Hash Cond: ("outer".technology_id = "inner".technology_id)
-> Nested Loop (cost=0.00..1603.66 rows=638 width=8) (actual
time=0.613..1185.826 rows=61416 loops=1)
-> Index Scan using device_index_device_name_key on
device_index d (cost=0.00..4.88 rows=1 width=4) (actual
time=0.246..0.259 rows=1 loops=1)
Index Cond: (device_name = 'CP8M'::text)
-> Index Scan using b_r_device_index on batch_report_index b
(cost=0.00..1589.93 rows=708 width=12) (actual time=0.324..928.888
rows=61416 loops=1)
Index Cond: (b.device_id = "outer".device_id)
-> Hash (cost=1.18..1.18 rows=1 width=4) (actual time=0.358..0.358
rows=0 loops=1)
-> Seq Scan on technology_index t (cost=0.00..1.18 rows=1
width=4) (actual time=0.196..0.238 rows=1 loops=1)
Filter: (technology_name = 'G12'::text)
Total runtime: 1538.302 ms

At this point, he said "send it to the -perform mailing list". So here
I am.

The relevant table schemas as of right now:
reg=# \d batch_report_index
Table "public.batch_report_index"
Column | Type |
Modifiers
-----------------+------------------------
+---------------------------------------------------------------------
batch_report_id | integer | not null default
nextval('"batch_report__batch_report__seq"'::text)
lot | character varying(16) |
tool_id | integer | not null
technology_id | integer | not null
device_id | integer | not null
reticle_id | integer | not null
layer_id | integer | not null
layer_number | integer | not null
image_id | integer |
start_date | date |
start_time | time without time zone |
stop_date | date |
stop_time | time without time zone |
in_system | boolean | default false
Indexes:
"batch_report_index_pkey" primary key, btree (batch_report_id)
"b_r_device_index" btree (device_id)
"batch_report_stop_date_indexing" btree (stop_date)
"batch_report_tool_id_index" btree (tool_id)

reg=# \d technology_index
Table "public.technology_index"
Column | Type | Modifiers
-----------------+---------
+---------------------------------------------------------------------
technology_id | integer | not null default
nextval('"technology_in_technology_id_seq"'::text)
technology_name | text | not null
Indexes:
"technology_index_pkey" primary key, btree (technology_id)
"technology_in_technology_na_key" unique, btree (technology_name)

reg=# \d device_index
Table "public.device_index"
Column | Type | Modifiers
-------------+---------
+----------------------------------------------------------------
device_id | integer | not null default
nextval('"device_index_device_id_seq"'::text)
device_name | text | not null
Indexes:
"device_index_pkey" primary key, btree (device_id)
"device_index_device_name_key" unique, btree (device_name)

Thanks for the great work y'all do!

Browse pgsql-performance by date

  From Date Subject
Next Message David Haas 2005-02-22 07:05:08 subquery vs join on 7.4.5
Previous Message Mark Kirkwood 2005-02-22 05:45:23 Re: Problem with 7.4.5 and webmin 1.8 in grant function