Re: Spilling hashed SetOps and aggregates to disk

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spilling hashed SetOps and aggregates to disk
Date: 2018-06-13 14:50:42
Message-ID: CAGTBQpa__-NP7=kKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 5, 2018 at 5:05 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On 06/05/2018 07:46 AM, Jeff Davis wrote:
> > On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
> >> I expect the eviction strategy to be the primary design challenge of
> >> this patch. The other bits will be mostly determined by this one
> >> piece.
> >
> > Not sure I agree that this is the primary challenge.
> >
> > The cases that benefit from eviction are probably a minority. I see two
> > categories that would benefit:
> >
> > * Natural clustering in the heap. This sounds fairly common, but a
> > lot of the cases that come to mind are too low-cardinality to be
> > compelling; e.g. timestamps grouped by hour/day/month. If someone has
> > run into a high-cardinality natural grouping case, let me know.
> > * ARRAY_AGG (or similar): individual state values can be large enough
> > that we need to evict to avoid exceeding work_mem even if not adding
> > any new groups.
> >
> > In either case, it seems like a fairly simple eviction strategy would
> > work. For instance, we could just evict the entire hash table if
> > work_mem is exceeded or if the hit rate on the hash table falls below a
> > certain threshold. If there was really something important that should
> > have stayed in the hash table, it will go back in soon anyway.
> >
> > So why should eviction be a major driver for the entire design? I agree
> > it should be an area of improvement for the future, so let me know if
> > you see a major problem, but I haven't been as focused on eviction.
> >
>
> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing
> the amount of data written to disk during evictions.
>
> Say for example that we can fit 1000 groups into work_mem, and that
> there are 2000 groups in the input data set. If the input is correlated
> with the groups, everything is peachy because we'll evict the first
> batch, and then group the remaining 1000 groups (or something like
> that). But if the input data is random (which can easily happen, e.g.
> for IP addresses, UUIDs and such) we'll hit the limit repeatedly and
> will evict much sooner.
>
> I know you suggested simply dumping the whole hash table and starting
> from scratch while we talked about this at pgcon, but ISTM it has
> exactly this issue.
>
> But I don't know if there actually is a better option - maybe we simply
> have to accept this problem. After all, running slowly is still better
> than OOM (which may or may not happen now).
>
> I wonder if we can somehow detect this at run-time and maybe fall-back
> to groupagg. E.g. we could compare number of groups vs. number of input
> tuples when we first hit the limit. It's a rough heuristics, but maybe
> sufficient.

I've been applying a strategy like that to do massive streaming
aggregation quite successfully.

The code I have is in python and in a private repo. There have been
talks of both opensourcing it, and of porting it into postgres as a
kind of aggregate node, because it sounds like a good idea. It seems
very related to this thread.

In essence, the technique I've been using always uses a fixed-size
hash table to do partial grouping. The table is never grown, when
collisions do happen, the existing entry in the hash table is flushed
to disk and the aggregate state in that bucket reset for the incoming
key. To avoid excessive spilling due to frequent collisions, we use a
kind of "lazy cuckoo" hash table. Lazy in the sense that it does no
relocations, it just checks 2 hash values, and if it has to evict, it
evicts the "oldest" value (with some cheap definition of "old").

The technique works very well to reduce temporary data size with a
fixed amount of working memory. The resulting spill file can then be
processed again to finalize the computation.

What I was pondering PG could do, is feed the spilled tuples to a sort
node, using the partial hash aggregation as a data-reducing node only.

scan -> partial hash agg -> sort -> final group agg

The group agg would have to know to consume and combine aggregate
states instead of producing them, but in essence it seems a relatively
efficient process.

An adaptive hash agg node would start as a hash agg, and turn into a
"partial hash agg + sort + final group agg" when OOM is detected.

The benefit over ordinary sort+group agg is that the sort is happening
on a potentially much smaller data set. When the buffer is large
enough to capture enough key repetitions, the output of the partial
hash agg can be orders of magnitude smaller than the scan.

My proposal was to use this for all group aggs, not only when the
planner chooses a hash agg, because it can speed up the sort and
reduce temp storage considerably. I guess the trick lies in estimating
that cardinality reduction to avoid applying this when there's no hope
of getting a payoff. The overhead of such a lazy hash table isn't
much, really. But yes, its cache locality is terrible, so it needs to
be considered.

Handling of memory pressure due to large aggregate states can be
accomplished post-accumulation. Assuming there's a way to measure
aggregate state size, a sum of aggregate sizes can be mantained, and
if it is beyond the memory budget, the biggest entry in the current
group can be spilled (each key can end up in 2 slots, so after
aggregation that key can be spilled, or its sister key, the one
sitting on the other slot). There's an overhead to that accounting
though, so it should be avoided for fixed-size aggregate states.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-06-13 14:51:06 Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()
Previous Message Robert Haas 2018-06-13 14:49:54 Re: PostgreSQL vs SQL Standard