Re: Hash join on int takes 8..114 seconds

From: PFC <lists(at)peufeu(dot)com>
To: Andrus <kobruleht2(at)hot(dot)ee>, "Richard Huxton" <dev(at)archonet(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash join on int takes 8..114 seconds
Date: 2008-11-20 13:46:11
Message-ID: op.ukw669uvcigqcu@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


OK so vmstat says you are IO-bound, this seems logical if the same plan
has widely varying timings...

Let's look at the usual suspects :

- how many dead rows in your tables ? are your tables data, or bloat ?
(check vacuum verbose, etc)
- what's the size of the dataset relative to the RAM ?

Now let's look more closely at the query :

explain analyze
SELECT sum(1)
FROM dok JOIN rid USING (dokumnr)
JOIN toode USING (toode)
LEFT JOIN artliik using(grupp,liik)
WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'

OK, so artliik is a very small table (84 rows) :

Seq Scan on artliik (cost=0.00..6.84 rows=84 width=19)
(actual time=20.104..29.845 rows=84 loops=1)

I presume doing the query without artliik changes nothing to the runtime,
yes ?
Let's look at the main part of the query :

FROM dok JOIN rid USING (dokumnr) JOIN toode USING (toode)
WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'

Postgres's plan is logical. It starts by joining rid and dok since your
WHERE is on those :

-> Hash Join (cost=52103.94..233488.08 rows=24126 width=24) (actual
time=100386.921..114037.986 rows=20588 loops=1)"
Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
-> Bitmap Heap Scan on rid (cost=4127.51..175020.84 rows=317003
width=28) (actual time=11119.932..76225.918 rows=277294 loops=1)"
Recheck Cond: (toode = 'X05'::bpchar)"
-> Bitmap Index Scan on rid_toode_idx (cost=0.00..4127.51 rows=317003
width=0) (actual time=11105.807..11105.807 rows=280599 loops=1)"
Index Cond: (toode = 'X05'::bpchar)"
-> Hash (cost=47376.82..47376.82 rows=93444 width=4) (actual
time=35082.427..35082.427 rows=105202 loops=1)"
-> Index Scan using dok_kuupaev_idx on dok (cost=0.00..47376.82
rows=93444 width=4) (actual time=42.110..34586.331 rows=105202 loops=1)"
Index Cond: (kuupaev >= '2008-09-01'::date)"

Your problem here is that, no matter what, postgres will have to examine
- all rows where dok.kuupaev>='2008-09-01',
- and all rows where rid.toode = 'X05'.
If you use dok.kuupaev>='2007-09-01' (note : 2007) it will probably have
to scan many, many more rows.

If you perform this query often you could CLUSTER rid on (toode) and dok
on (kuupaev), but this can screw other queries.

What is the meaning of the columns ?

To make this type of query faster I would tend to think about :

- materialized views
- denormalization (ie adding a column in one of your tables and a
multicolumn index)
- materialized summary tables (ie. summary of sales for last month, for
instance)

"Aggregate (cost=234278.53..234278.54 rows=1 width=0) (actual
time=114479.933..114479.936 rows=1 loops=1)"
" -> Hash Left Join (cost=52111.20..234218.21 rows=24126 width=0)
(actual time=100435.523..114403.293 rows=20588 loops=1)"
" Hash Cond: (("outer".grupp = "inner".grupp) AND ("outer".liik =
"inner".liik))"
" -> Nested Loop (cost=52103.94..233735.35 rows=24126 width=19)
(actual time=100405.258..114207.387 rows=20588 loops=1)"
" -> Index Scan using toode_pkey on toode (cost=0.00..6.01
rows=1 width=43) (actual time=18.312..18.325 rows=1 loops=1)"
" Index Cond: ('X05'::bpchar = toode)"
" -> Hash Join (cost=52103.94..233488.08 rows=24126
width=24) (actual time=100386.921..114037.986 rows=20588 loops=1)"
" Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
" -> Bitmap Heap Scan on rid (cost=4127.51..175020.84
rows=317003 width=28) (actual time=11119.932..76225.918 rows=277294
loops=1)"
" Recheck Cond: (toode = 'X05'::bpchar)"
" -> Bitmap Index Scan on rid_toode_idx
(cost=0.00..4127.51 rows=317003 width=0) (actual time=11105.807..11105.807
rows=280599 loops=1)"
" Index Cond: (toode = 'X05'::bpchar)"
" -> Hash (cost=47376.82..47376.82 rows=93444
width=4) (actual time=35082.427..35082.427 rows=105202 loops=1)"
" -> Index Scan using dok_kuupaev_idx on dok
(cost=0.00..47376.82 rows=93444 width=4) (actual time=42.110..34586.331
rows=105202 loops=1)"
" Index Cond: (kuupaev >=
'2008-09-01'::date)"
" -> Hash (cost=6.84..6.84 rows=84 width=19) (actual
time=30.220..30.220 rows=84 loops=1)"
" -> Seq Scan on artliik (cost=0.00..6.84 rows=84 width=19)
(actual time=20.104..29.845 rows=84 loops=1)"
"Total runtime: 114480.373 ms"

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2008-11-21 01:09:36 Re: Performance and IN clauses
Previous Message Richard Huxton 2008-11-20 11:22:31 Re: Hash join on int takes 8..114 seconds