Re: Spilling hashed SetOps and aggregates to disk

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: David Gershuni <dgershun(at)cs(dot)cmu(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spilling hashed SetOps and aggregates to disk
Date: 2018-06-20 05:36:43
Message-ID: 1529473003.3820.44.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2018-06-15 at 12:30 -0700, David Gershuni wrote:
> For example, imagine a tuple that belongs to a group G in grouping
> set A had to be
> spilled because it also belonged to an evicted group from grouping
> set B. Then group
> G would remain in set A’s hash table at the end of the phase, but it
> wouldn’t have
> aggregated the values from the spilled tuple. Of course, work-arounds
> are possible,
> but the current approach would not work as is.

I was imagining that (in the simplest approach) each hash table would
have its own set of spilled tuples. If grouping set A can be advanced
(because it's present in the hash table) but B cannot (because it's not
in the hash table and we are at the memory limit), then it goes into
the spill file for B but not A. That means that A is still finished
after the first pass.

But I am worried that I am missing something, because it appears that
for AGG_MIXED, we wait until the end of the last phase to emit the
contents of the hash tables. Wouldn't they be complete after the first
phase?

I realize that this simple approach could be inefficient if multiple
hash tables are spilling because we'd duplicate the spilled tuples. But
let me know if it won't work at all.

> But as you mentioned, this solution relies on all grouping keys being
> sortable, so we
> would need a fallback plan. For this, we could use your hash-based
> approach, but we
> would have to make modifications. One idea would be to split each
> grouping set into its
> own aggregate phase, so that only one hash table is in memory at a
> time. This would
> eliminate the performance benefits of grouping sets when keys aren’t
> sortable, but at
> least they would still work. 

I am having a hard time trying to satisfy all of the constraints that
have been raised so far:

* handle data types with hash ops but no btree ops
* handle many different group size distributions and input orders
efficiently
* avoid regressions or inefficiencies for grouping sets
* handle variable-size (particularly O(n)-space) transition states

without taking on a lot of complexity. I like to work toward the right
design, but I think simplicity also factors in to the right design.
NodeAgg.c is already one of the larger and more complicated files that
we have.

A lot of the complexity we already have is that grouping sets are
performed in one giant executor node rather than exposing the
individual phases as separate executor nodes. I suppose the reason
(correct me if I'm wrong) is that there are two outputs from a phase:
the aggregated groups, and the original input tuples in a new sort
order. That seems solvable -- the executor passes bitmaps around until
BitmapHeapScan turns them into tuples.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-06-20 05:44:18 Re: ToDo: show size of partitioned table
Previous Message Gavin Flower 2018-06-20 05:35:01 Re: Partitioning with temp tables is broken