Re: Final Patch for GROUPING SETS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Svenne Krap <svenne(at)krap(dot)dk>
Subject: Re: Final Patch for GROUPING SETS
Date: 2015-04-30 00:07:30
Message-ID: 20150430000730.GB16366@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

This is not a real review. I'm just scanning through the patch, without
reading the thread, to understand if I see something "worthy" of
controversy. While scanning I might have a couple observations or
questions.

On 2015-03-13 15:46:15 +0000, Andrew Gierth wrote:

> + * A list of grouping sets which is structurally equivalent to a ROLLUP
> + * clause (e.g. (a,b,c), (a,b), (a)) can be processed in a single pass over
> + * ordered data. We do this by keeping a separate set of transition values
> + * for each grouping set being concurrently processed; for each input tuple
> + * we update them all, and on group boundaries we reset some initial subset
> + * of the states (the list of grouping sets is ordered from most specific to
> + * least specific). One AGG_SORTED node thus handles any number of grouping
> + * sets as long as they share a sort order.

Found "initial subset" not very clear, even if I probably guessed the
right meaning.

> + * To handle multiple grouping sets that _don't_ share a sort order, we use
> + * a different strategy. An AGG_CHAINED node receives rows in sorted order
> + * and returns them unchanged, but computes transition values for its own
> + * list of grouping sets. At group boundaries, rather than returning the
> + * aggregated row (which is incompatible with the input rows), it writes it
> + * to a side-channel in the form of a tuplestore. Thus, a number of
> + * AGG_CHAINED nodes are associated with a single AGG_SORTED node (the
> + * "chain head"), which creates the side channel and, when it has returned
> + * all of its own data, returns the tuples from the tuplestore to its own
> + * caller.

This paragraph deserves to be expanded imo.

> + * In order to avoid excess memory consumption from a chain of alternating
> + * Sort and AGG_CHAINED nodes, we reset each child Sort node preemptively,
> + * allowing us to cap the memory usage for all the sorts in the chain at
> + * twice the usage for a single node.

What does reseting 'preemtively' mean?

> + * From the perspective of aggregate transition and final functions, the
> + * only issue regarding grouping sets is this: a single call site (flinfo)
> + * of an aggregate function may be used for updating several different
> + * transition values in turn. So the function must not cache in the flinfo
> + * anything which logically belongs as part of the transition value (most
> + * importantly, the memory context in which the transition value exists).
> + * The support API functions (AggCheckCallContext, AggRegisterCallback) are
> + * sensitive to the grouping set for which the aggregate function is
> + * currently being called.

Hm. I've seen a bunch of aggreates do this.

> + * TODO: AGG_HASHED doesn't support multiple grouping sets yet.

Are you intending to resolve this before an eventual commit? Possibly
after the 'structural' issues are resolved? Or do you think this can
safely be put of for another release?

> @@ -534,11 +603,13 @@ static void
> advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
> {
> int aggno;
> + int setno = 0;
> + int numGroupingSets = Max(aggstate->numsets, 1);
> + int numAggs = aggstate->numaggs;
>
> - for (aggno = 0; aggno < aggstate->numaggs; aggno++)
> + for (aggno = 0; aggno < numAggs; aggno++)
> {
> AggStatePerAgg peraggstate = &aggstate->peragg[aggno];
> - AggStatePerGroup pergroupstate = &pergroup[aggno];
> ExprState *filter = peraggstate->aggrefstate->aggfilter;
> int numTransInputs = peraggstate->numTransInputs;
> int i;
> @@ -582,13 +653,16 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
> continue;
> }
>
> - /* OK, put the tuple into the tuplesort object */
> - if (peraggstate->numInputs == 1)
> - tuplesort_putdatum(peraggstate->sortstate,
> - slot->tts_values[0],
> - slot->tts_isnull[0]);
> - else
> - tuplesort_puttupleslot(peraggstate->sortstate, slot);
> + for (setno = 0; setno < numGroupingSets; setno++)
> + {
> + /* OK, put the tuple into the tuplesort object */
> + if (peraggstate->numInputs == 1)
> + tuplesort_putdatum(peraggstate->sortstates[setno],
> + slot->tts_values[0],
> + slot->tts_isnull[0]);
> + else
> + tuplesort_puttupleslot(peraggstate->sortstates[setno], slot);
> + }
> }

Hm. So a normal GROUP BY is just a subcase of grouping sets. Seems to
make sense, but worthwhile to mention somewhere in the intro.

> + if (!node->agg_done)
> + {
> + /* Dispatch based on strategy */
> + switch (((Agg *) node->ss.ps.plan)->aggstrategy)
> + {
> + case AGG_HASHED:
> + if (!node->table_filled)
> + agg_fill_hash_table(node);
> + result = agg_retrieve_hash_table(node);
> + break;
> + case AGG_CHAINED:
> + result = agg_retrieve_chained(node);
> + break;
> + default:
> + result = agg_retrieve_direct(node);
> + break;
> + }
> +
> + if (!TupIsNull(result))
> + return result;
> + }

Maybe it's just me, but I get twitchy if I see a default being used like
this. I'd much, much rather see the two remaining AGG_* types and get a
warning from the compiler if a new one is added.
> + /*-
> + * If a subgroup for the current grouping set is present, project it.
> + *
> + * We have a new group if:
> + * - we're out of input but haven't projected all grouping sets
> + * (checked above)
> + * OR
> + * - we already projected a row that wasn't from the last grouping
> + * set
> + * AND
> + * - the next grouping set has at least one grouping column (since
> + * empty grouping sets project only once input is exhausted)
> + * AND
> + * - the previous and pending rows differ on the grouping columns
> + * of the next grouping set
> + */
> + if (aggstate->input_done
> + || (node->aggstrategy == AGG_SORTED
> + && aggstate->projected_set != -1
> + && aggstate->projected_set < (numGroupingSets - 1)
> + && nextSetSize > 0
> + && !execTuplesMatch(econtext->ecxt_outertuple,
> + tmpcontext->ecxt_outertuple,
> + nextSetSize,
> + node->grpColIdx,
> + aggstate->eqfunctions,
> + tmpcontext->ecxt_per_tuple_memory)))

I'll bet this will look absolutely horrid after a pgindent run :/

> +/*
> + * We want to produce the absolute minimum possible number of lists here to
> + * avoid excess sorts. Fortunately, there is an algorithm for this; the problem
> + * of finding the minimal partition of a poset into chains (which is what we
> + * need, taking the list of grouping sets as a poset ordered by set inclusion)
> + * can be mapped to the problem of finding the maximum cardinality matching on
> + * a bipartite graph, which is solvable in polynomial time with a worst case of
> + * no worse than O(n^2.5) and usually much better. Since our N is at most 4096,
> + * we don't need to consider fallbacks to heuristic or approximate methods.
> + * (Planning time for a 12-d cube is under half a second on my modest system
> + * even with optimization off and assertions on.)

I think using the long form of poset once would be a good thing.

> + * We use the Hopcroft-Karp algorithm for the graph matching; it seems to work
> + * well enough for our purposes. This implementation is based on pseudocode
> + * found at:
> + *
> + * http://en.wikipedia.org/w/index.php?title=Hopcroft%E2%80%93Karp_algorithm&oldid=593898016
> + *
> + * This implementation uses the same indices for elements of U and V (the two
> + * halves of the graph) because in our case they are always the same size, and
> + * we always know whether an index represents a u or a v. Index 0 is reserved
> + * for the NIL node.
> + */
> +
> +struct hk_state
> +{
> + int graph_size; /* size of half the graph plus NIL node */
> + int matching;
> + short **adjacency; /* adjacency[u] = [n, v1,v2,v3,...,vn] */
> + short *pair_uv; /* pair_uv[u] -> v */
> + short *pair_vu; /* pair_vu[v] -> u */
> + float *distance; /* distance[u], float so we can have +inf */
> + short *queue; /* queue storage for breadth search */
> +};

I wonder if it'd not be better to put this in a separate file. Most
readers just won't care about this bit and the file is long enough.

> - if (!parse->hasAggs && !parse->groupClause && !root->hasHavingQual &&
> + if (!parse->hasAggs && !parse->groupClause && !parse->groupingSets && !root->hasHavingQual &&
> !parse->hasWindowFuncs)
> {

Looks like well above 80 lines.
> %nonassoc UNBOUNDED /* ideally should have same precedence as IDENT */
> -%nonassoc IDENT NULL_P PARTITION RANGE ROWS PRECEDING FOLLOWING
> +%nonassoc IDENT NULL_P PARTITION RANGE ROWS PRECEDING FOLLOWING CUBE ROLLUP
> %left Op OPERATOR /* multi-character ops and user-defined operators */

> +/*
> + * These hacks rely on setting precedence of CUBE and ROLLUP below that of '(',
> + * so that they shift in these rules rather than reducing the conflicting
> + * unreserved_keyword rule.
> + */
> +
> +rollup_clause:
> + ROLLUP '(' expr_list ')'
> + {
> + $$ = (Node *) makeGroupingSet(GROUPING_SET_ROLLUP, $3, @1);
> + }
> + ;
> +
> +cube_clause:
> + CUBE '(' expr_list ')'
> + {
> + $$ = (Node *) makeGroupingSet(GROUPING_SET_CUBE, $3, @1);
> + }
> + ;
> +
> +grouping_sets_clause:
> + GROUPING SETS '(' group_by_list ')'
> + {
> + $$ = (Node *) makeGroupingSet(GROUPING_SET_SETS, $4, @1);
> + }
> + ;
> +

This is somewhat icky. I've not really thought abuot this very much, but
it's imo something to pay attention to.

So, having quickly scanned through the patch, do I understand correctly
that the contentious problems are:

* Arguably this converts the execution *tree* into a DAG. Tom seems to
be rather uncomfortable with that. I am wondering whether this really
is a big deal - essentially this only happens in a relatively
'isolated' part of the tree right? I.e. if those chained together
nodes were considered one node, there would not be any loops?
Additionally, the way parametrized scans works already essentially
"violates" the tree paradigma somewhat.
There still might be better representations, about which I want to
think, don't get me wrong. I'm "just" not seing this as a fundamental
problem.
* The whole grammar/keyword issue. To me this seems to be a problem of
swallowing one out of several similarly coloured poisonous
pills. Which we can't really avoid, i.e. this isn't really this
patches fault that we have to make them.

Are those the two bigger controversial areas? Or are there others in
your respective views?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-04-30 00:11:48 Re: Incompatible trig error handling
Previous Message Petr Jelinek 2015-04-29 23:47:44 Re: alternative compression algorithms?