Re: Cost Issue - How do I force a Hash Join

From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Cost Issue - How do I force a Hash Join
Date: 2006-02-21 05:54:14
Message-ID: 43FAAB06.5030901@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

"Virag Saksena" <virag(at)auptyma(dot)com> writes:
>> The individual queries run in 50-300 ms. However the optimizer is
>> choosing a nested loop to join them rather than a Hash join...

I have what appears to be the identical problem.

This is a straightforward query that should be fairly quick, but takes about 30 minutes. It's a query across three tables, call them A, B, and C. The tables are joined on indexed columns.

Here's a quick summary:

Table A -----> Table B -----> Table C
A_ID B_ID C_ID
A_ID NAME
C_ID

Tables A and B have 6 million rows each. Table C is small: 67 names, no repeats. All columns involved in the join are indexed.

Summary:
1. Query B only: 2.7 seconds, 302175 rows returned
2. Join B and C: 4.3 seconds, exact same answer
3. Join A and B: 7.2 minutes, exact same answer
4. Join A, B, C: 32.7 minutes, exact same answer

Looking at these:

Query #1 is doing the real work: finding the rows of interest.

Queries #1 and #2 ought to be virtually identical, since Table C has
just one row with C_ID = 9, but the time almost doubles.

Query #3 should take a bit longer than Query #1 because it has to join
300K rows, but the indexes should make this take just a few seconds,
certainly well under a minute.

Query #4 should be identical to Query #3, again because there's only
one row in Table C. 32 minutes is pretty horrible for such a
straightforward query.

It looks to me like the problem is the use of nested loops when a hash join should be used, but I'm no expert at query planning.

This is psql 8.0.3. Table definitions are at the end. (Table and column names are altered to protect the guilty, otherwise these are straight from Postgres.) I ran "vacuum full analyze" after the last data were added. Hardware is a Dell, 2-CPU Xeon, 4 GB memory, database is on a single SATA 7200RPM disk.

Thanks,
Craig

----------------------------

QUERY #1:
---------

explain analyze select B.A_ID from B where B.B_ID = 9;

Index Scan using i_B_B_ID on B (cost=0.00..154401.36 rows=131236 width=4) (actual time=0.158..1387.251 rows=302175 loops=1)
Index Cond: (B_ID = 9)
Total runtime: 2344.053 ms

QUERY #2:
---------

explain analyze select B.A_ID from B join C on (B.C_ID = C.C_ID) where C.name = 'Joe';

Nested Loop (cost=0.00..258501.92 rows=177741 width=4) (actual time=0.349..3392.532 rows=302175 loops=1)
-> Seq Scan on C (cost=0.00..12.90 rows=1 width=4) (actual time=0.232..0.336 rows=1 loops=1)
Filter: ((name)::text = 'Joe'::text)
-> Index Scan using i_B_C_ID on B (cost=0.00..254387.31 rows=328137 width=8) (actual time=0.102..1290.002 rows=302175 loops=1)
Index Cond: (B.C_ID = "outer".C_ID)
Total runtime: 4373.916 ms

QUERY #3:
---------

explain analyze
select A.A_ID from A
join B on (A.A_ID = B.A_ID)
where B.B_ID = 9;

Nested Loop (cost=0.00..711336.41 rows=131236 width=4) (actual time=37.118..429419.347 rows=302175 loops=1)
-> Index Scan using i_B_B_ID on B (cost=0.00..154401.36 rows=131236 width=4) (actual time=27.344..8858.489 rows=302175 loops=1)
Index Cond: (B_ID = 9)
-> Index Scan using pk_A_test on A (cost=0.00..4.23 rows=1 width=4) (actual time=1.372..1.376 rows=1 loops=302175)
Index Cond: (A.A_ID = "outer".A_ID)
Total runtime: 430467.686 ms

QUERY #4:
---------
explain analyze
select A.A_ID from A
join B on (A.A_ID = B.A_ID)
join C on (B.B_ID = C.B_ID)
where C.name = 'Joe';

Nested Loop (cost=0.00..1012793.38 rows=177741 width=4) (actual time=70.184..1960112.247 rows=302175 loops=1)
-> Nested Loop (cost=0.00..258501.92 rows=177741 width=4) (actual time=52.114..17753.638 rows=302175 loops=1)
-> Seq Scan on C (cost=0.00..12.90 rows=1 width=4) (actual time=0.109..0.176 rows=1 loops=1)
Filter: ((name)::text = 'Joe'::text)
-> Index Scan using i_B_B_ID on B (cost=0.00..254387.31 rows=328137 width=8) (actual time=51.985..15566.896 rows=302175 loops=1)
Index Cond: (B.B_ID = "outer".B_ID)
-> Index Scan using pk_A_test on A (cost=0.00..4.23 rows=1 width=4) (actual time=6.407..6.412 rows=1 loops=302175)
Index Cond: (A.A_ID = "outer".A_ID)
Total runtime: 1961200.079 ms

TABLE DEFINITIONS:
------------------

xxx => \d a
Table "xxx.a"
Column | Type | Modifiers
-------------------+------------------------+-----------
a_id | integer | not null
... more columns

Indexes:
"pk_a_id" PRIMARY KEY, btree (a_id)
... more indexes on other columns

xxx => \d b
Table "xxx.b"
Column | Type | Modifiers
--------------------------+------------------------+-----------
b_id | integer | not null
a_id | integer | not null
c_id | integer | not null
... more columns

Indexes:
"b_pkey" PRIMARY KEY, btree (b_id)
"i_b_a_id" btree (a_id)
"i_b_c_id" btree (c_id)

xxx=> \d c
Table "xxx.c"
Column | Type | Modifiers
---------------+------------------------+-----------
c_id | integer | not null
name | character varying(200) |
... more columns
Indexes:
"c_pkey" PRIMARY KEY, btree (c_id)

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Virag Saksena 2006-02-21 06:33:39 Re: Cost Issue - How do I force a Hash Join
Previous Message Tom Lane 2006-02-21 05:35:55 Re: Cost Issue - How do I force a Hash Join