Skip site navigation (1) Skip section navigation (2)

Re: Hash join on int takes 8..114 seconds

From: PFC <lists(at)peufeu(dot)com>
To: Andrus <kobruleht2(at)hot(dot)ee>, "Richard Huxton" <dev(at)archonet(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash join on int takes 8..114 seconds
Date: 2008-11-20 13:46:11
Message-ID: op.ukw669uvcigqcu@soyouz (view raw or flat)
Thread:
Lists: pgsql-performance
	OK so vmstat says you are IO-bound, this seems logical if the same plan  
has widely varying timings...

	Let's look at the usual suspects :

	- how many dead rows in your tables ? are your tables data, or bloat ?  
(check vacuum verbose, etc)
	- what's the size of the dataset relative to the RAM ?

	Now let's look more closely at the query :

explain analyze
SELECT sum(1)
   FROM dok JOIN rid USING (dokumnr)
   JOIN toode USING (toode)
   LEFT JOIN artliik using(grupp,liik)
   WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'


	OK, so artliik is a very small table (84 rows) :

Seq Scan on artliik  (cost=0.00..6.84 rows=84 width=19)
(actual time=20.104..29.845 rows=84 loops=1)

	I presume doing the query without artliik changes nothing to the runtime,  
yes ?
	Let's look at the main part of the query :

   FROM dok JOIN rid USING (dokumnr) JOIN toode USING (toode)
   WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'

	Postgres's plan is logical. It starts by joining rid and dok since your  
WHERE is on those :

->  Hash Join  (cost=52103.94..233488.08 rows=24126 width=24) (actual  
time=100386.921..114037.986 rows=20588 loops=1)"
	Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
	->  Bitmap Heap Scan on rid  (cost=4127.51..175020.84 rows=317003  
width=28) (actual time=11119.932..76225.918 rows=277294 loops=1)"
		  Recheck Cond: (toode = 'X05'::bpchar)"
		  ->  Bitmap Index Scan on rid_toode_idx (cost=0.00..4127.51 rows=317003  
width=0) (actual time=11105.807..11105.807 rows=280599 loops=1)"
				Index Cond: (toode = 'X05'::bpchar)"
	->  Hash  (cost=47376.82..47376.82 rows=93444 width=4) (actual  
time=35082.427..35082.427 rows=105202 loops=1)"
		  ->  Index Scan using dok_kuupaev_idx on dok (cost=0.00..47376.82  
rows=93444 width=4) (actual time=42.110..34586.331 rows=105202 loops=1)"
				Index Cond: (kuupaev >= '2008-09-01'::date)"

	Your problem here is that, no matter what, postgres will have to examine
	- all rows where dok.kuupaev>='2008-09-01',
	- and all rows where rid.toode = 'X05'.
	If you use dok.kuupaev>='2007-09-01' (note : 2007) it will probably have  
to scan many, many more rows.

	If you perform this query often you could CLUSTER rid on (toode) and dok  
on (kuupaev), but this can screw other queries.

	What is the meaning of the columns ?

	To make this type of query faster I would tend to think about :

	- materialized views
	- denormalization (ie adding a column in one of your tables and a  
multicolumn index)
	- materialized summary tables (ie. summary of sales for last month, for  
instance)


"Aggregate  (cost=234278.53..234278.54 rows=1 width=0) (actual  
time=114479.933..114479.936 rows=1 loops=1)"
"  ->  Hash Left Join  (cost=52111.20..234218.21 rows=24126 width=0)  
(actual time=100435.523..114403.293 rows=20588 loops=1)"
"        Hash Cond: (("outer".grupp = "inner".grupp) AND ("outer".liik =  
"inner".liik))"
"        ->  Nested Loop  (cost=52103.94..233735.35 rows=24126 width=19)  
(actual time=100405.258..114207.387 rows=20588 loops=1)"
"              ->  Index Scan using toode_pkey on toode  (cost=0.00..6.01  
rows=1 width=43) (actual time=18.312..18.325 rows=1 loops=1)"
"                    Index Cond: ('X05'::bpchar = toode)"
"              ->  Hash Join  (cost=52103.94..233488.08 rows=24126  
width=24) (actual time=100386.921..114037.986 rows=20588 loops=1)"
"                    Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
"                    ->  Bitmap Heap Scan on rid  (cost=4127.51..175020.84  
rows=317003 width=28) (actual time=11119.932..76225.918 rows=277294  
loops=1)"
"                          Recheck Cond: (toode = 'X05'::bpchar)"
"                          ->  Bitmap Index Scan on rid_toode_idx  
(cost=0.00..4127.51 rows=317003 width=0) (actual time=11105.807..11105.807  
rows=280599 loops=1)"
"                                Index Cond: (toode = 'X05'::bpchar)"
"                    ->  Hash  (cost=47376.82..47376.82 rows=93444  
width=4) (actual time=35082.427..35082.427 rows=105202 loops=1)"
"                          ->  Index Scan using dok_kuupaev_idx on dok  
(cost=0.00..47376.82 rows=93444 width=4) (actual time=42.110..34586.331  
rows=105202 loops=1)"
"                                Index Cond: (kuupaev >=  
'2008-09-01'::date)"
"        ->  Hash  (cost=6.84..6.84 rows=84 width=19) (actual
time=30.220..30.220 rows=84 loops=1)"
"              ->  Seq Scan on artliik  (cost=0.00..6.84 rows=84 width=19)
(actual time=20.104..29.845 rows=84 loops=1)"
"Total runtime: 114480.373 ms"

In response to

Responses

pgsql-performance by date

Next:From: Tomas VondraDate: 2008-11-21 01:09:36
Subject: Re: Performance and IN clauses
Previous:From: Richard HuxtonDate: 2008-11-20 11:22:31
Subject: Re: Hash join on int takes 8..114 seconds

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group