Re: Final Patch for GROUPING SETS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Svenne Krap <svenne(at)krap(dot)dk>
Subject: Re: Final Patch for GROUPING SETS
Date: 2015-05-14 05:50:31
Message-ID: 20150514055031.GE9584@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-05-13 22:51:15 +0200, Andres Freund wrote:
> I'm pretty sure by now that I dislike the introduction of GroupedVar,
> and not just tentatively. While I can see why you found its
> introduction to be nicer than fiddling with the result tuple, for me the
> disadvantages seem to outweigh the advantage. For one it's rather wierd
> to have Var nodes be changed into GroupedVar in setrefs.c. The number
> of places that need to be touched even when it's a 'planned stmt only'
> type of node is still pretty large.
>
> Andrew: I'll work on changing this in a couple hours unless you're
> speaking up about doing it yourself.

I did a stab at removing it, and it imo definitely ends up looking
better.

The code for the GroupedVar replacement isn't perfect yet, but I think
it'd be possible to clean that up until Friday. Unfortunately, after
prolonged staring out of the window, I came to the conclusion that I
don't think the current tree structure isn't right.

I still believe that the general approach of chaining vs. a union or CTE
is correct due to the efficiency arguments upthread. My problem is
that, unless I very much misunderstand something, the current
implementation can end up requiring roughly #sets * #input of additional
space for the "sidechannel tuplestore" in some bad cases. That happens
if you group by a couple clauses that each lead to a high number of
groups.

That happens because the aggregated rows produced in the chain nodes
can't be returned up-tree, because the the next chain (or final group
aggregate) node will expect unaggregated tuples. The current solution
for that is to move the aggregated rows produced in chain nodes into a
tuplestore that's then drained when the top level aggregate node has
done it's own job.

While that's probably not too bad in many cases because most of the use
cases aggregation will be relatively effective, it does seem to be
further evidence that the sidechannel tuplestore isn't the perfect idea.

What I think we should/need to do instead is to the chaining locally
inside one aggregation node. That way the aggregated tuples can be
returned directly, without the sidechannel. While that will require
inlining part of the code from nodeSort.c it doesn't seem too bad.

Besides the advantage of getting rid of that tuplestore, it'll also fix
the explain performance problems (as there's no deep tree to traverse
via ruleutils.c), get rid of the the preemtive ExecReScan() to control
memory usage. I think it might also make combined hashing/sorting
easier.

A rough sketch of what I'm thinking of is:
ExecAgg()
{
...
while (!aggstate->consumed_input)
{
outerslot = ExecProcNode(outerPlanState(aggstate));

if (TupIsNull(outerslot))
{
consumed_input = true;
break;
}

if (aggstate->doing_hashing)
{
entry = lookup_hash_entry(aggstate, outerslot);

/* Advance the aggregates */
advance_aggregates(aggstate, entry->pergroup);
}

if (aggstate->presorted_input || AGG_PLAIN)
{
/* handle aggregation, return if done with group */
}

if (aggstate->doing_chaining)
{
tuplesort_puttupleslot(tuplesortstate, slot);
}
}

if (aggstate->doing_hashing && !done)
agg_retrieve_hashed();

/*
* Feed data from one sort to the next, to compute grouping sets that
* need differing sort orders.
*/
last_sort = tuplesortstate[0];
current_sort = numGroupingSets > 0 ? tuplesortstate[1] : NULL;

while (aggstate->doing_chaining && !done_sorting)
{
tuplesort_gettupleslot(last_sort, tmpslot);

/* exhausted all tuple from a particular sort order, move on */
if (TupIsNull(tmpslot))
{
finalize_aggregates(...);
tuplesort_end(last_sort); /* maybe save stats somewhere? */
last_sort = current_sort;
current_sort = tuplesortstate[...];
if (all_sorts_done)
done_sorting = true;

return aggregated;
}

if (current_sort != NULL)
tuplesort_puttupleslot(current_sort, slot);

/* check if we crossed a boundary */
if (!execTuplesMatch(...))
{
finalize_aggregates(...);
aggstate->grp_firstTuple = ...
return aggregated;
}

advance_aggregates();
tuplesort_puttupleslot(current_sort, slot);
}
}

I think this is quite doable and seems likely to actually end up with
easier to understand code. But unfortunately it seems to be big enough
of a change to make it unlikely to be done in sufficient quality until
the freeze. I'll nonetheless work a couple hours on it tomorrow.

Andrew, is that a structure you could live with, or not?

Others, what do you think?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-05-14 06:01:11 Re: proposal: contrib module - generic command scheduler
Previous Message Noah Misch 2015-05-14 04:13:38 Re: Why does contain_leaked_vars believe MinMaxExpr is safe?