Re: 9.5: Memory-bounded HashAgg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-12-11 20:22:31
Message-ID: 5489FD07.2080708@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.12.2014 11:46, Jeff Davis wrote:
>
> New patch attached. All open items are complete, though the patch may
> have a few rough edges.
>
> Summary of changes:
>
> * rebased on top of latest memory accounting patch
> http://www.postgresql.org/message-id/1417497257.5584.5.camel@jeff-desktop
> * added a flag to hash_create to prevent it from creating an extra
> level of memory context
> - without this, the memory accounting would have a measurable impact
> on performance
> * cost model for the disk usage
> * intelligently choose the number of partitions for each pass of the
> data
> * explain support
> * in build_hash_table(), be more intelligent about the value of
> nbuckets to pass to BuildTupleHashTable()
> - BuildTupleHashTable tries to choose a value to keep the table in
> work_mem, but it isn't very accurate.
> * some very rudimentary testing (sanity checks, really) shows good
> results

I plan to look into this over the holidays, hopefully.

> Summary of previous discussion (my summary; I may have missed some
> points):
>
> Tom Lane requested that the patch also handle the case where transition
> values grow (e.g. array_agg) beyond work_mem. I feel this patch provides
> a lot of benefit as it is, and trying to handle that case would be a lot
> more work (we need a way to write the transition values out to disk at a
> minimum, and perhaps combine them with other transition values). I also
> don't think my patch would interfere with a fix there in the future.
>
> Tomas Vondra suggested an alternative design that more closely resembles
> HashJoin: instead of filling up the hash table and then spilling any new
> groups, the idea would be to split the current data into two partitions,
> keep one in the hash table, and spill the other (see
> ExecHashIncreaseNumBatches()). This has the advantage that it's very
> fast to identify whether the tuple is part of the in-memory batch or
> not; and we can avoid even looking in the memory hashtable if not.
>
> The batch-splitting approach has a major downside, however: you are
> likely to evict a skew value from the in-memory batch, which will result
> in all subsequent tuples with that skew value going to disk. My approach
> never evicts from the in-memory table until we actually finalize the
> groups, so the skew values are likely to be completely processed in the
> first pass.

I don't think that's the main issue - there are probably ways to work
around that (e.g. by keeping a "skew hash table" for those frequent
values, similarly to what hash join does).

The main problem IMHO is that it requires writing the transition values
to disk, which we don't know in many cases (esp. in the interesting
ones, where the transtion values grow).

> So, the attached patch implements my original approach, which I still
> feel is the best solution.

I think this is a reasonable approach - it's true it does no handle the
case with growing aggregate state (e.g. array_agg), so it really fixes
"just" the case when we underestimate the number of groups.

But I believe we need this approach anyway, becauce we'll never know how
to write all the various transition values (e.g. think of custom
aggregates), and this is an improvement.

We can build on this and add the more elaborate hashjoin-like approach
in the future.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2014-12-11 20:27:10 Re: Commitfest problems
Previous Message Tom Lane 2014-12-11 20:12:23 Re: WIP patch for Oid formatting in printf/elog strings