Re: Why hash join instead of nested loop?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rhett Garber <rhettg(at)gmail(dot)com>
Cc: Havasvölgyi Ottó <h(dot)otto(at)freemail(dot)hu>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Why hash join instead of nested loop?
Date: 2005-08-09 00:58:26
Message-ID: 10930.1123549106@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Rhett Garber <rhettg(at)gmail(dot)com> writes:
> This is postgres 7.4.1
> All the rows involved are integers.

Hmph. There is something really strange going on here. I tried to
duplicate your problem in 7.4.*, thus:

regression=# create table rtmessagestate(id int, f1 char(6));
CREATE TABLE
regression=# insert into rtmessagestate values(1,'z');
INSERT 559399 1
regression=# insert into rtmessagestate values(2,'z');
INSERT 559400 1
regression=# insert into rtmessagestate values(3,'z');
INSERT 559401 1
regression=# insert into rtmessagestate values(4,'z');
INSERT 559402 1
regression=# insert into rtmessagestate values(5,'z');
INSERT 559403 1
regression=# vacuum analyze rtmessagestate;
VACUUM
regression=# create table connection(connection_registry_id int, obj1 int, obj2 int);
CREATE TABLE
regression=# create index connection_regid_obj1_index on connection(connection_registry_id,obj1);
CREATE INDEX
regression=# insert into connection values(40105,73582,3);
INSERT 559407 1
regression=# explain analyze select rtmessagestate.* from rtmessagestate,connection where (connection_registry_id = 40105) AND (obj1 = 73582) and id = obj2;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=4.83..5.91 rows=1 width=14) (actual time=0.498..0.544 rows=1 loops=1)
Hash Cond: ("outer".id = "inner".obj2)
-> Seq Scan on rtmessagestate (cost=0.00..1.05 rows=5 width=14) (actual time=0.030..0.072 rows=5 loops=1)
-> Hash (cost=4.83..4.83 rows=1 width=4) (actual time=0.305..0.305 rows=0 loops=1)
-> Index Scan using connection_regid_obj1_index on connection (cost=0.00..4.83 rows=1 width=4) (actual time=0.236..0.264 rows=1 loops=1)
Index Cond: ((connection_registry_id = 40105) AND (obj1 = 73582))
Total runtime: 1.119 ms
(7 rows)

This duplicates your example as to plan and row counts:

> Hash Join (cost=5.96..7.04 rows=1 width=14) (actual
> time=10.591..10.609 rows=1 loops=1)
> Hash Cond: ("outer".id = "inner".obj2)
> -> Seq Scan on rtmessagestate (cost=0.00..1.05 rows=5 width=14)
> (actual time=0.011..0.022 rows=5 loops=1)
> -> Hash (cost=5.96..5.96 rows=1 width=4) (actual
> time=0.109..0.109 rows=0 loops=1)
> -> Index Scan using connection_regid_obj1_index on
> connection (cost=0.00..5.96 rows=1 width=4) (actual time=0.070..0.076
> rows=1 loops=1)
> Index Cond: ((connection_registry_id = 40105) AND (obj1
> = 73582)) Total runtime: 11.536 ms
> (7 rows)

My machine is considerably slower than yours, to judge by the actual
elapsed times in the scan nodes ... so why is it beating the pants
off yours in the join step?

Can you try the above script verbatim in a scratch database and see
what you get? (Note it's worth trying the explain two or three
times to be sure the values have settled out.)

I'm testing a fairly recent 7.4-branch build (7.4.8 plus), so that's one
possible reason for the discrepancy between my results and yours, but I
do not see anything in the 7.4 CVS logs that looks like it's related to
hashjoin performance.

I'd be interested to see results from other people using 7.4.* too.

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Steinar H. Gunderson 2005-08-09 01:08:50 Re: Why hash join instead of nested loop?
Previous Message Tom Lane 2005-08-08 19:27:25 Re: Finding bottleneck