From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | David Gershuni <dgershun(at)cs(dot)cmu(dot)edu> |
Cc: | Claudio Freire <klaussfreire(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Spilling hashed SetOps and aggregates to disk |
Date: | 2018-06-20 05:36:43 |
Message-ID: | 1529473003.3820.44.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 2018-06-15 at 12:30 -0700, David Gershuni wrote:
> For example, imagine a tuple that belongs to a group G in grouping
> set A had to be
> spilled because it also belonged to an evicted group from grouping
> set B. Then group
> G would remain in set A’s hash table at the end of the phase, but it
> wouldn’t have
> aggregated the values from the spilled tuple. Of course, work-arounds
> are possible,
> but the current approach would not work as is.
I was imagining that (in the simplest approach) each hash table would
have its own set of spilled tuples. If grouping set A can be advanced
(because it's present in the hash table) but B cannot (because it's not
in the hash table and we are at the memory limit), then it goes into
the spill file for B but not A. That means that A is still finished
after the first pass.
But I am worried that I am missing something, because it appears that
for AGG_MIXED, we wait until the end of the last phase to emit the
contents of the hash tables. Wouldn't they be complete after the first
phase?
I realize that this simple approach could be inefficient if multiple
hash tables are spilling because we'd duplicate the spilled tuples. But
let me know if it won't work at all.
> But as you mentioned, this solution relies on all grouping keys being
> sortable, so we
> would need a fallback plan. For this, we could use your hash-based
> approach, but we
> would have to make modifications. One idea would be to split each
> grouping set into its
> own aggregate phase, so that only one hash table is in memory at a
> time. This would
> eliminate the performance benefits of grouping sets when keys aren’t
> sortable, but at
> least they would still work.
I am having a hard time trying to satisfy all of the constraints that
have been raised so far:
* handle data types with hash ops but no btree ops
* handle many different group size distributions and input orders
efficiently
* avoid regressions or inefficiencies for grouping sets
* handle variable-size (particularly O(n)-space) transition states
without taking on a lot of complexity. I like to work toward the right
design, but I think simplicity also factors in to the right design.
NodeAgg.c is already one of the larger and more complicated files that
we have.
A lot of the complexity we already have is that grouping sets are
performed in one giant executor node rather than exposing the
individual phases as separate executor nodes. I suppose the reason
(correct me if I'm wrong) is that there are two outputs from a phase:
the aggregated groups, and the original input tuples in a new sort
order. That seems solvable -- the executor passes bitmaps around until
BitmapHeapScan turns them into tuples.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2018-06-20 05:44:18 | Re: ToDo: show size of partitioned table |
Previous Message | Gavin Flower | 2018-06-20 05:35:01 | Re: Partitioning with temp tables is broken |