Join problem

From: Silke Trissl <trissl(at)informatik(dot)hu-berlin(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Subject: Join problem
Date: 2004-04-27 17:32:36
Message-ID: 408E9934.8060509@informatik.hu-berlin.de
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-performance

Hi,

I came across a very intriguing thing:

I had to join two tables and in both tables I wanted to restrict the
result set by some (text/varchar) attributes.

Here is an example:

Table "item" # 147 000 entries

Column | Type | Modifiers
---------------+-----------------------+------------
id | integer | not null
description | text |
comment | text | not null
order_id | integer |

Table "orders" # 210 000 entries
Column | Type | Modifiers
-----------------+------------------------+-----------
order_id | integer |
order_name | character varying(255) |

The tables have 147 000 and 210 000 entries, respectively.

First I tried the following query, which took ages:

(Query 1)
EXPLAIN ANALYZE
SELECT item.id
FROM item, orders
WHERE orders.order_name ~* 'Smit'
AND item.description ~* 'CD'
and orders.order_id = item.order_id;

I found out, that the change of the operator from '~*' to '=' for the
item.description brought a great boost in performance (425 secs to 1
sec!), but not in cost (Query plans at the end).

(Query 2)
EXPLAIN ANALYZE
SELECT item.id
FROM item, orders
WHERE orders.order_name ~* 'Smit'
AND item.description = 'CD'
and orders.order_id = item.order_id;

The main difference was that Query 2 used the Hash join instead of the
Nested Loop, so I disabled the option 'NESTED LOOP' and got for Query 1
a similar time as for Query 2.

Can anyone tell me, why in one case the Hash join and in the other the
much worse Nested Loop is prefered?
And my second question is, is there any possibility to execute the first
query without disabling the Nested Loop first, but get the good
performance of the Hash join?

Many thanks in advance for your help or suggestions

Silke

QUERY PLANS:

#####################################

Query 1:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=0.00..28836.75 rows=1 width=4) (actual
time=65350.780..452130.702 rows=6 loops=1)
Join Filter: ("inner".order_id = "outer".order_id)
-> Seq Scan on item (cost=0.00..28814.24 rows=1 width=8) (actual
time=33.180..1365.190 rows=716 loops=1)
Filter: (description ~* 'CD'::text)
-> Seq Scan on orders (cost=0.00..22.50 rows=1 width=4) (actual
time=21.644..629.500 rows=18 loops=716)
Filter: ((order_name)::text ~* 'Smith'::text)
Total runtime: 452130.782 ms
###########################################################################

Query 2:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------

Hash Join (cost=22.50..28840.44 rows=4 width=4) (actual
time=1187.798..1187.798 rows=0 loops=1)
Hash Cond: ("outer".order_id = "inner".order_id)
-> Seq Scan on item (cost=0.00..28814.24 rows=733 width=8) (actual
time=542.737..542.737 rows=0 loops=1)
Filter: (description = 'CD'::text)
-> Hash (cost=22.50..22.50 rows=1 width=4) (actual
time=645.042..645.042 rows=0 loops=1)
-> Seq Scan on orders (cost=0.00..22.50 rows=1 width=4)
(actual time=22.373..644.996 rows=18 loops=1)
Filter: ((order_name)::text ~* 'Smith'::text)
Total runtime: 1187.865 ms
############################################################################

Query 1 with 'set enable_nestloop to false'

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=22.50..28836.75 rows=1 width=4) (actual
time=1068.593..2003.330 rows=6 loops=1)
Hash Cond: ("outer".item_id = "inner".item_id)
-> Seq Scan on item (cost=0.00..28814.24 rows=1 width=8) (actual
time=33.347..1357.073 rows=716 loops=1)
Filter: (description ~* 'CD'::text)
-> Hash (cost=22.50..22.50 rows=1 width=4) (actual
time=645.287..645.287 rows=0 loops=1)
-> Seq Scan on orders (cost=0.00..22.50 rows=1 width=4)
(actual time=22.212..645.239 rows=18 loops=1)
Filter: ((order_name)::text ~* 'CD'::text)
Total runtime: 2003.409 ms

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message scott.marlowe 2004-04-27 18:02:45 Re: OT: Help with performance problems
Previous Message Dave Cramer 2004-04-27 15:18:34 Re: [JDBC] [PERFORM] is a good practice to create an index on the