query plan question, nested loop vs hash join

From: Andrey Lizenko <lizenko79(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: query plan question, nested loop vs hash join
Date: 2014-10-03 15:38:15
Message-ID: CADKuZZB=TeBszkxmzPGeZ14cBm4wpCqK9TsfXyUoooamPzekOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,
I have similar problem as in
http://www.postgresql.org/message-id/flat/52B311C4(dot)1070108(at)gmail(dot)com#52B311C4(dot)1070108@gmail.com

server version is 9.3.4

Here is only two quite simple tables:

db_new=# \d activities_example
Table "public.activities_example"
Column | Type | Modifiers
----------------+---------+-----------
id | integer |
order_chain_id | integer |
Indexes:
"activities_example_idx" btree (order_chain_id)

db_new=# \d orders_example
Table "public.orders_example"
Column | Type | Modifiers
--------+---------+-----------
id | integer |

Number of rows as below:

db_new=# select count(*) from activities_example ;
count
---------
3059965

db_new=# select count(*) from orders_example ;
count
-------
19038

db_new=# select count(*) from activities_example where order_chain_id in
(select id from orders_example);
count
-------
91426
(1 row)

and I can see that planner uses hashjoin with all enabled options and
nested loop with disabled parameter:

db_new=# explain analyze select * from activities_example where
order_chain_id in (select id from orders_example);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=513.36..57547.59 rows=89551 width=8) (actual
time=18.340..966.367 rows=91426 loops=1)
Hash Cond: (activities_example.order_chain_id = orders_example.id)
-> Seq Scan on activities_example (cost=0.00..44139.65 rows=3059965
width=8) (actual time=0.018..294.216 rows=3059965 loops=1)
-> Hash (cost=275.38..275.38 rows=19038 width=4) (actual
time=5.458..5.458 rows=19038 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 670kB
-> Seq Scan on orders_example (cost=0.00..275.38 rows=19038
width=4) (actual time=0.015..2.308 rows=19038 loops=1)
Total runtime: 970.234 ms
(7 rows)

db_new=# set enable_hashjoin = off;
SET
db_new=# explain analyze select * from activities_example where
order_chain_id in (select id from orders_example);
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1629.09..166451.01 rows=89551 width=8) (actual
time=16.091..116.476 rows=91426 loops=1)
-> Unique (cost=1628.66..1723.85 rows=19038 width=4) (actual
time=15.929..23.156 rows=19038 loops=1)
-> Sort (cost=1628.66..1676.25 rows=19038 width=4) (actual
time=15.892..19.884 rows=19038 loops=1)
Sort Key: orders_example.id
Sort Method: external sort Disk: 264kB
-> Seq Scan on orders_example (cost=0.00..275.38
rows=19038 width=4) (actual time=0.015..2.747 rows=19038 loops=1)
-> Index Scan using activities_example_idx on activities_example
(cost=0.43..8.60 rows=5 width=8) (actual time=0.002..0.004 rows=5
loops=19038)
Index Cond: (order_chain_id = orders_example.id)
Total runtime: 121.366 ms
(9 rows)

second runtime is much more quicker.

What is the reason of "Seq Scan on activities_example" in the first case?
Is it possible to force optimizer choose the second plan without doing
"set enable_hashjoin = off;" ?

Increasing of 'effective_cache_size' leads to similar thing with
mergejoin,
other options (work_mem, shared_buffers. etc) do not change anything.

Thanks in advance.

--
Regards, Andrey Lizenko

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2014-10-03 16:33:18 Re: Planning for Scalability
Previous Message Roberto Grandi 2014-10-03 13:43:49 Re: Planning for Scalability