Skip site navigation (1) Skip section navigation (2)

Re: Speeding up aggregates

From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Speeding up aggregates
Date: 2002-12-08 19:31:54
Message-ID: 3DF39E2A.4010802@joeconway.com (view raw or flat)
Thread:
Lists: pgsql-performance
Tom Lane wrote:
> FWIW, I've implemented hashed aggregation in CVS tip.  I have not had
> the time to try to benchmark it, but I'd be interested if anyone can
> run some tests on 7.4devel.  Eliminating the need for a SORT step
> should help aggregations over large datasets.
> 
> Note that even though there's no SORT, the sort_mem setting is used
> to determine the allowable hashtable size, so a too-small sort_mem
> might discourage the planner from selecting hashed aggregation.
> Use EXPLAIN to see which query plan gets chosen.
> 

Here's some tests on a reasonable sized (and real life as opposed to 
contrived) dataset:

parts=# set enable_hashagg to off;
SET
parts=# explain analyze select i.part_id, sum(w.qty_oh) as total_oh from inv 
i, iwhs w where i.part_id = w.part_id group by i.part_id;
                                                             QUERY PLAN 

----------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=11111.93..11744.90 rows=35528 width=36) (actual 
time=2799.40..3140.17 rows=34575 loops=1)
    ->  Sort  (cost=11111.93..11293.31 rows=72553 width=36) (actual 
time=2799.35..2896.43 rows=72548 loops=1)
          Sort Key: i.part_id
          ->  Hash Join  (cost=1319.10..5254.45 rows=72553 width=36) (actual 
time=157.72..1231.01 rows=72548 loops=1)
                Hash Cond: ("outer".part_id = "inner".part_id)
                ->  Seq Scan on iwhs w  (cost=0.00..2121.53 rows=72553 
width=22) (actual time=0.01..286.80 rows=72553 loops=1)
                ->  Hash  (cost=1230.28..1230.28 rows=35528 width=14) (actual 
time=157.50..157.50 rows=0 loops=1)
                      ->  Seq Scan on inv i  (cost=0.00..1230.28 rows=35528 
width=14) (actual time=0.02..88.00 rows=35528 loops=1)
  Total runtime: 3168.73 msec
(9 rows)

parts=# set enable_hashagg to on;
SET
parts=# explain analyze select i.part_id, sum(w.qty_oh) as total_oh from inv 
i, iwhs w where i.part_id = w.part_id group by i.part_id;
                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=5617.22..5706.04 rows=35528 width=36) (actual 
time=1507.89..1608.32 rows=34575 loops=1)
    ->  Hash Join  (cost=1319.10..5254.45 rows=72553 width=36) (actual 
time=153.46..1231.34 rows=72548 loops=1)
          Hash Cond: ("outer".part_id = "inner".part_id)
          ->  Seq Scan on iwhs w  (cost=0.00..2121.53 rows=72553 width=22) 
(actual time=0.01..274.74 rows=72553 loops=1)
          ->  Hash  (cost=1230.28..1230.28 rows=35528 width=14) (actual 
time=153.21..153.21 rows=0 loops=1)
                ->  Seq Scan on inv i  (cost=0.00..1230.28 rows=35528 
width=14) (actual time=0.03..84.67 rows=35528 loops=1)
  Total runtime: 1661.53 msec
(7 rows)

parts=# explain analyze select i.part_id, sum(w.qty_oh) as total_oh from inv 
i, iwhs w where i.part_id = w.part_id group by i.part_id having sum(w.qty_oh) > 0;
                                                             QUERY PLAN 

----------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=11111.93..12015.10 rows=35528 width=36) (actual 
time=2823.65..3263.16 rows=4189 loops=1)
    Filter: (sum(qty_oh) > 0::double precision)
    ->  Sort  (cost=11111.93..11293.31 rows=72553 width=36) (actual 
time=2823.40..2926.07 rows=72548 loops=1)
          Sort Key: i.part_id
          ->  Hash Join  (cost=1319.10..5254.45 rows=72553 width=36) (actual 
time=156.39..1240.61 rows=72548 loops=1)
                Hash Cond: ("outer".part_id = "inner".part_id)
                ->  Seq Scan on iwhs w  (cost=0.00..2121.53 rows=72553 
width=22) (actual time=0.01..290.47 rows=72553 loops=1)
                ->  Hash  (cost=1230.28..1230.28 rows=35528 width=14) (actual 
time=156.16..156.16 rows=0 loops=1)
                      ->  Seq Scan on inv i  (cost=0.00..1230.28 rows=35528 
width=14) (actual time=0.02..86.95 rows=35528 loops=1)
  Total runtime: 3282.27 msec
(10 rows)


Note that similar to Josh, I saw a nice improvement when using the 
HashAggregate on the simpler case, but as soon as I added a HAVING clause the 
optimizer switched back to GroupAggregate.

I'll try to play around with this a bit more later today.

Joe


In response to

Responses

pgsql-performance by date

Next:From: Joe ConwayDate: 2002-12-08 19:37:47
Subject: Re: Speeding up aggregates
Previous:From: Richard HuxtonDate: 2002-12-08 13:48:44
Subject: Re: Query optimization

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group