On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <stark(at)mit(dot)edu> writes:
>> I have a sketch for how to handle spilling hash aggregates to disk in
>> my head. I'm not sure if it's worth the amount of complexity it would
>> require but I'll poke around a bit and see if it works out well.
> It'd be awfully nice if those could spill to disk. I think that
> currently that's the only plan type where a misestimate can lead to
> hard failure rather than just slower-than-you'd-like. Which is not
> nice considering that the estimates are necessarily just estimates.
> Could you give us a brain dump on the sketch? I've never seen how to
> do it without unreasonable overhead.
Hm. So my original plan was dependent on adding the state-merge
function we've talked about in the past. Not all aggregate functions
necessarily can support such a function but I think all or nearly all
the builtin aggregates can. Certainly min,max, count, sum, avg,
stddev, array_agg can which are most of what people do. That would be
a function which can take two state variables and produce a new state
If you have this then you can sort and spill the hash table and start
a new hash table and keep going. When you're done you merge the hash
tables using something like heap merge but applying the state merge
However now that I've started thinking about it further I think you
could solve it with less complexity by cheating in various ways. For
example if you limit the hash size to 1/2 of work_mem then you when
you reach that limit you could just stuff any tuple that doesn't match
a hash entry into a tuplesort with 1/2 of work_mem and do the regular
level break logic on the output of that.
Or if you can count on tuplesort to be a stable sort (and I think it
might be already) then you could just spill the sorted hash entries
and then switch to doing a tapesort for the rest. When you finish
merging you can read them back and do level break logic like normal.
If there is a partially computed state it will be the first one in the
equal group. If you can't count on tuplesort to be stable you could
dump the partially computed states to a different tape and do a merge
between it and the sorted output of the main data set.
The holy grail of this kind of merging of the two algorithms would be
something that keeps the hash table going and maintains an lru of hash
entries. When it grows too large it would spill the oldest partial
state. Then it would sort those states and merge them (possibly in a
single step). That might be too complex to pull its weight given some
of the above sketches are probably simple enough.
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2012-03-26 23:53:25|
|Subject: Re: Cross-backend signals and administration (Was: Re:
pg_terminate_backend for same-role)|
|Previous:||From: Tom Lane||Date: 2012-03-26 23:39:30|
|Subject: Re: 9.2 commitfest closure (was Command Triggers, v16)|