9.5: Memory-bounded HashAgg

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: 9.5: Memory-bounded HashAgg
Date: 2014-08-10 21:26:50
Message-ID: 1407706010.6623.16.camel@jeff-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This patch is requires the Memory Accounting patch, or something similar
to track memory usage.

The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.

This is a well-known concept; there's even a Berkeley homework
assignment floating around to implement it -- in postgres 7.2, no
less. I didn't take the exact same approach as the homework assignment
suggests, but it's not much different, either. My apologies if some
classes are still using this as a homework assignment, but postgres
needs to eventually have an answer to this problem.

Included is a GUC, "enable_hashagg_disk" (default on), which allows
the planner to choose hashagg even if it doesn't expect the hashtable
to fit in memory. If it's off, and the planner misestimates the
cardinality, hashagg will still use the disk to contain itself to

One situation that might surprise the user is if work_mem is set too
low, and the user is *relying* on a misestimate to pick hashagg. With
this patch, it would end up going to disk, which might be
significantly slower. The solution for the user is to increase

Rough Design:

Change the hash aggregate algorithm to accept a generic "work item",
which consists of an input file as well as some other bookkeeping

Initially prime the algorithm by adding a single work item where the
file is NULL, indicating that it should read from the outer plan.

If the memory is exhausted during execution of a work item, then
continue to allow existing groups to be aggregated, but do not allow new
groups to be created in the hash table. Tuples representing new groups
are saved in an output partition file referenced in the work item that
is currently being executed.

When the work item is done, emit any groups in the hash table, clear the
hash table, and turn each output partition file into a new work item.

Each time through at least some groups are able to stay in the hash
table, so eventually none will need to be saved in output partitions, no
new work items will be created, and the algorithm will terminate. This
is true even if the number of output partitions is always one.

Open items:
* costing
* EXPLAIN details for disk usage
* choose number of partitions intelligently
* performance testing

Initial tests indicate that it can be competitive with sort+groupagg
when the disk is involved, but more testing is required.

Feedback welcome.

Jeff Davis

Attachment Content-Type Size
hashagg-disk-20140810.patch text/x-patch 39.4 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2014-08-10 21:35:30 Re: Partitioning performance: cache stringToNode() of pg_constraint.ccbin
Previous Message Jeff Davis 2014-08-10 20:50:16 Re: 9.5: Better memory accounting, towards memory-bounded HashAgg