Skip site navigation (1) Skip section navigation (2)

Join problem

From: Silke Trissl <trissl(at)informatik(dot)hu-berlin(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Subject: Join problem
Date: 2004-04-27 17:32:36
Message-ID: 408E9934.8060509@informatik.hu-berlin.de (view raw or flat)
Thread:
Lists: pgsql-performance
Hi,

I came across a very intriguing thing:

I had to join two tables and in both tables I wanted to restrict the 
result set by some (text/varchar) attributes.

Here is an example:

Table "item" # 147 000 entries

     Column     |         Type          | Modifiers
---------------+-----------------------+------------
  id            | integer               | not null
  description   | text                  |
  comment       | text                  | not null
  order_id      | integer               |


Table "orders" # 210 000 entries
       Column      |          Type          | Modifiers
-----------------+------------------------+-----------
  order_id        | integer                |
  order_name      | character varying(255) |


The tables have 147 000 and 210 000 entries, respectively.

First I tried the following query, which took ages:

(Query 1)
EXPLAIN ANALYZE
SELECT item.id
FROM item, orders
WHERE orders.order_name ~* 'Smit'
   AND item.description ~* 'CD'
   and orders.order_id = item.order_id;



I found out, that the change of the operator from '~*' to '=' for the 
item.description brought a great boost in performance (425 secs to 1 
sec!), but not in cost (Query plans at the end).

(Query 2)
  EXPLAIN ANALYZE
SELECT item.id
FROM item, orders
WHERE orders.order_name ~* 'Smit'
   AND item.description = 'CD'
   and orders.order_id = item.order_id;


The main difference was that Query 2 used the Hash join instead of the 
Nested Loop, so I disabled the option 'NESTED LOOP' and got for Query 1 
a similar time as for Query 2.


Can anyone tell me, why in one case the Hash join and in the other the 
much worse Nested Loop is prefered?
And my second question is, is there any possibility to execute the first 
query without disabling the Nested Loop first, but get the good 
performance of the Hash join?


Many thanks in advance for your help or suggestions

Silke


QUERY PLANS:

#####################################

Query 1:
   QUERY PLAN
------------------------------------------------------------------------------------------------------------------ 

  Nested Loop  (cost=0.00..28836.75 rows=1 width=4) (actual 
time=65350.780..452130.702 rows=6 loops=1)
    Join Filter: ("inner".order_id = "outer".order_id)
    ->  Seq Scan on item  (cost=0.00..28814.24 rows=1 width=8) (actual 
time=33.180..1365.190 rows=716 loops=1)
          Filter: (description ~* 'CD'::text)
    ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4) (actual 
time=21.644..629.500 rows=18 loops=716)
          Filter: ((order_name)::text ~* 'Smith'::text)
  Total runtime: 452130.782 ms
###########################################################################

Query 2:
   QUERY PLAN
------------------------------------------------------------------------------------------------------------------ 

   Hash Join  (cost=22.50..28840.44 rows=4 width=4) (actual 
time=1187.798..1187.798 rows=0 loops=1)
    Hash Cond: ("outer".order_id = "inner".order_id)
    ->  Seq Scan on item  (cost=0.00..28814.24 rows=733 width=8) (actual 
time=542.737..542.737 rows=0 loops=1)
          Filter: (description = 'CD'::text)
    ->  Hash  (cost=22.50..22.50 rows=1 width=4) (actual 
time=645.042..645.042 rows=0 loops=1)
          ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4) 
(actual time=22.373..644.996 rows=18 loops=1)
                Filter: ((order_name)::text ~* 'Smith'::text)
  Total runtime: 1187.865 ms
############################################################################


Query 1 with  'set enable_nestloop to false'

    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=22.50..28836.75 rows=1 width=4) (actual 
time=1068.593..2003.330 rows=6 loops=1)
    Hash Cond: ("outer".item_id = "inner".item_id)
    ->  Seq Scan on item  (cost=0.00..28814.24 rows=1 width=8) (actual 
time=33.347..1357.073 rows=716 loops=1)
          Filter: (description ~* 'CD'::text)
    ->  Hash  (cost=22.50..22.50 rows=1 width=4) (actual 
time=645.287..645.287 rows=0 loops=1)
          ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4) 
(actual time=22.212..645.239 rows=18 loops=1)
                Filter: ((order_name)::text ~* 'CD'::text)
  Total runtime: 2003.409 ms


Responses

pgsql-performance by date

Next:From: scott.marloweDate: 2004-04-27 18:02:45
Subject: Re: OT: Help with performance problems
Previous:From: Dave CramerDate: 2004-04-27 15:18:34
Subject: Re: [JDBC] [PERFORM] is a good practice to create an index on the

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group