Re: 9.5: Memory-bounded HashAgg

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-14 16:54:45
Message-ID: 1408035285.2335.163.camel@jeff-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:
> Either it belongs to the current batch (and either it's in the hash
> table, or you add it there), or it's not - in that case write it to a
> temp file.

I think the part you left out is that you need two files per batch: one
for the dumped-out partially-computed state values, and one for the

In other words, you haven't really discussed the step where you reunite
the tuples with that partially-computed state.

> For sure, it's not for free - it may write to quite a few files. Is it
> more expensive than what you propose? I'm not sure about that. With
> your batching scheme, you'll end up with lower number of large batches,
> and you'll need to read and split them, possibly repeatedly. The
> batching scheme from hashjoin minimizes this.

My approach only has fewer batches if it elects to have fewer batches,
which might happen for two reasons:
1. A cardinality misestimate. This certainly could happen, but we do
have useful numbers to work from (we know the number of tuples and
distincts that we've read so far), so it's far from a blind guess.
2. We're concerned about the random I/O from way too many partitions.

> I fail to see how this is different from your approach? How can you
> output any tuples before processing the whole inner relation?

Right, the only thing I avoid is scanning the hash table and dumping out
the groups.

This isn't a major distinction, more like "my approach does a little
less work before returning tuples", and I'm not even sure I can defend
that, so I'll retract this point.

> Your approach is to do multi-level batching, and I was thinking whether
> it'd be possible to use the same approach (single level). But in
> retrospect it probably does not make much sense, because the multi-level
> batching is one of the points of the proposed approach.

Now that I think about it, many of the points we discussed could
actually work with either approach:
* In my approach, if I need more partitions, I could create more in
much the same way as HashJoin to keep it single-level (as you suggest
* In your approach, if there are too many partitions, you could avoid
random I/O by intentionally putting tuples from multiple partitions in a
single file and moving them while reading.
* If given a way to write out the partially-computed states, I could
evict some groups from the hash table to keep an array_agg() bounded.

Our approaches only differ on one fundamental trade-off that I see:
(A) My approach requires a hash lookup of an already-computed hash for
every incoming tuple, not only the ones going into the hash table.
(B) Your approach requires scanning the hash table and dumping out the
states every time the hash table fills up, which therefore requires a
way to dump out the partial states.

You could probably win the argument by pointing out that (A) is O(N) and
(B) is O(log2(N)). But I suspect that cost (A) is very low.

Unfortunately, it would take some effort to test your approach because
we'd actually need a way to write out the partially-computed state, and
the algorithm itself seems a little more complex. So I'm not really sure
how to proceed.

Jeff Davis

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-08-14 16:55:49 Re: Function to know last log write timestamp
Previous Message Tom Lane 2014-08-14 16:53:37 Re: 9.5: Memory-bounded HashAgg