Re: 9.5: Memory-bounded HashAgg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-10 23:29:12
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


it's 1AM here, so only a few comments after quickly reading the patch.

On 10.8.2014 23:26, Jeff Davis wrote:
> This patch is requires the Memory Accounting patch, or something
> similar to track memory usage.

I think the patch you sent actually includes the accounting patch. Is
that on purpose, or by accident?

I'd suggest keeping these two patches separate.

> Rough Design:
> Change the hash aggregate algorithm to accept a generic "work item",
> which consists of an input file as well as some other bookkeeping
> information.
> Initially prime the algorithm by adding a single work item where the
> file is NULL, indicating that it should read from the outer plan.
> If the memory is exhausted during execution of a work item, then
> continue to allow existing groups to be aggregated, but do not allow
> new groups to be created in the hash table. Tuples representing new
> groups are saved in an output partition file referenced in the work
> item that is currently being executed.
> When the work item is done, emit any groups in the hash table, clear
> the hash table, and turn each output partition file into a new work
> item.
> Each time through at least some groups are able to stay in the hash
> table, so eventually none will need to be saved in output
> partitions, no new work items will be created, and the algorithm will
> terminate. This is true even if the number of output partitions is
> always one.

So once a group gets into memory, it stays there? That's going to work
fine for aggregates with fixed-size state (int4, or generally state that
gets allocated and does not grow), but I'm afraid for aggregates with
growing state (as for example array_agg and similar) that's not really a

How difficult would it be to dump the current state into a file (and
remove them from the hash table)?

While hacking on the hash join, I envisioned the hash aggregate might
work in a very similar manner, i.e. something like this:

* nbatches=1, nbits=0
* when work_mem gets full => nbatches *= 2, nbits += 1
* get rid of half the groups, using nbits from the hash
=> dump the current states into 'states.batchno' file
=> dump further tuples to 'tuples.batchno' file
* continue until the end, or until work_mem gets full again

This is pretty much what the hashjoin does, except that the join needs
to batch the outer relation too (which hashagg does not need to do).
Otherwise most of the batching logic can be copied.

It also seems to me that the logic of the patch is about this:

* try to lookup the group in the hash table
* found => call the transition function
* not found
* enough space => call transition function
* not enough space => tuple/group goes to a batch

Which pretty much means all tuples need to do the lookup first. The nice
thing on the hash-join approach is that you don't really need to do the
lookup - you just need to peek at the hash whether the group belongs to
the current batch (and if not, to which batch it should go).

Of course, that would require the ability to dump the current state of
the group, but for the aggregates using basic types as a state
(int4/int8, ...) with fixed-length state that's trivial.

For aggregates using 'internal' to pass pointers that requires some help
from the author - serialization/deserialization functions.

> Open items:
> * costing

Not sure how this is done for the hash-join, but I guess that might be a
good place for inspiration.

> * EXPLAIN details for disk usage
> * choose number of partitions intelligently

What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
we need 2048 partitions, why should we use less if we believe it will
get us over work_mem?

> * performance testing
> Initial tests indicate that it can be competitive with sort+groupagg
> when the disk is involved, but more testing is required.

For us, removing the sort is a big deal, because we're working with
>100M rows regularly. It's more complicated though, because the sort is
usually enforced by COUNT(DISTINCT) and that's not going to disappear
because of this patch. But that's solvable with a custom aggregate.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2014-08-11 00:23:05 Function to know last log write timestamp
Previous Message Stephen Frost 2014-08-10 23:11:43 Re: Proposal to add a QNX 6.5 port to PostgreSQL