Re: 9.5: Memory-bounded HashAgg

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-12-11 10:46:47
Message-ID: 1418294807.5584.32.camel@jeff-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote:
> This patch is requires the Memory Accounting patch, or something similar
> to track memory usage.
> The attached patch enables hashagg to spill to disk, which means that
> hashagg will contain itself to work_mem even if the planner makes a
> bad misestimate of the cardinality.

New patch attached. All open items are complete, though the patch may
have a few rough edges.

Summary of changes:

* rebased on top of latest memory accounting patch
* added a flag to hash_create to prevent it from creating an extra
level of memory context
- without this, the memory accounting would have a measurable impact
on performance
* cost model for the disk usage
* intelligently choose the number of partitions for each pass of the
* explain support
* in build_hash_table(), be more intelligent about the value of
nbuckets to pass to BuildTupleHashTable()
- BuildTupleHashTable tries to choose a value to keep the table in
work_mem, but it isn't very accurate.
* some very rudimentary testing (sanity checks, really) shows good

Summary of previous discussion (my summary; I may have missed some

Tom Lane requested that the patch also handle the case where transition
values grow (e.g. array_agg) beyond work_mem. I feel this patch provides
a lot of benefit as it is, and trying to handle that case would be a lot
more work (we need a way to write the transition values out to disk at a
minimum, and perhaps combine them with other transition values). I also
don't think my patch would interfere with a fix there in the future.

Tomas Vondra suggested an alternative design that more closely resembles
HashJoin: instead of filling up the hash table and then spilling any new
groups, the idea would be to split the current data into two partitions,
keep one in the hash table, and spill the other (see
ExecHashIncreaseNumBatches()). This has the advantage that it's very
fast to identify whether the tuple is part of the in-memory batch or
not; and we can avoid even looking in the memory hashtable if not.

The batch-splitting approach has a major downside, however: you are
likely to evict a skew value from the in-memory batch, which will result
in all subsequent tuples with that skew value going to disk. My approach
never evicts from the in-memory table until we actually finalize the
groups, so the skew values are likely to be completely processed in the
first pass.

So, the attached patch implements my original approach, which I still
feel is the best solution.

Jeff Davis

Attachment Content-Type Size
hashagg-disk-20141211.patch text/x-patch 44.9 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-12-11 11:01:08 Re: Directory/File Access Permissions for COPY and Generic File Access Functions
Previous Message Etsuro Fujita 2014-12-11 10:30:01 Re: inherit support for foreign tables