Re: Parallel Aggregate

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Paul Ramsey <pramsey(at)cleverelephant(dot)ca>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregate
Date: 2016-03-18 13:16:51
Message-ID: CAKJS1f80=f-z1CUU7=QDmn0r=_yeU7paN2dZ6rQSnUpfEFOUNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 March 2016 at 20:25, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Few more comments:
>
> 1.
>
> + if (parse->groupClause)
> + path = (Path *) create_sort_path(root,
> + grouped_rel,
> + path,
> + root->group_pathkeys,
> + -1.0);
>
> For final path, why do you want to sort just for group by case?

If there's no GROUP BY then there will only be a single group, this
does not require sorting, e.g SELECT SUM(col) from sometable;

I added the comment:

/*
* Gather is always unsorted, so we'll need to sort, unless there's
* no GROUP BY clause, in which case there will only be a single
* group.
*/

> 2.
> + path = (Path *) create_gather_path(root, partial_grouped_rel, path,
> + NULL, &total_groups);
> +
> + if (parse->groupClause)
> + path = (Path *) create_sort_path(root,
> + grouped_rel,
> + path,
> + root->group_pathkeys,
> + -1.0);
> +
> + if (parse->hasAggs)
> + add_path(grouped_rel, (Path *)
> + create_agg_path(root,
> + grouped_rel,
> + path,
> + target,
> + parse->groupClause ? AGG_SORTED : AGG_PLAIN,
> + parse->groupClause,
> + (List *) parse->havingQual,
> + &agg_costs,
> + partial_grouped_rel->rows,
> + true,
> + true));
> + else
> + add_path(grouped_rel, (Path *)
> + create_group_path(root,
> + grouped_rel,
> + path,
> + target,
> + parse->groupClause,
> + (List *) parse->havingQual,
> + total_groups));
>
> In above part of patch, it seems you are using number of groups
> differenetly; for create_group_path() and create_gather_path(), you have
> used total_groups whereas for create_agg_path() partial_grouped_rel->rows is
> used, is there a reason for the same?

That's a mistake... too much code shuffling yesterday it seems.

> 3.
> + if (grouped_rel->partial_pathlist)
> + {
> + Path *path = (Path *)
> linitial(grouped_rel->partial_pathlist);
> + double total_groups;
> +
> + total_groups =
> path->rows * path->parallel_degree;
> + path = (Path *) create_gather_path(root, partial_grouped_rel,
> path,
> + NULL, &total_groups);
>
> A. Won't passing partial_grouped_rel lead to incomplete information required
> by create_gather_path() w.r.t the case of parameterized path info?

There should be no parameterized path info after joins are over, but
never-the-less I took your advice about passing PathTarget to
create_gather_path(), so this partial_grouped_rel no longer exists.

> B. You have mentioned that passing grouped_rel will make gather path contain
> the information of final path target, but what is the problem with that? I
> mean to ask why Gather node is required to contain partial path target
> information instead of final path target.

Imagine a query such as: SELECT col,SUM(this) FROM sometable GROUP BY
col HAVING SUM(somecolumn) > 0;

In this case SUM(somecolumn) won't be in the final PathTarget. The
partial grouping target will contain the Aggref from the HAVING
clause. The other difference with te partial aggregate PathTarget is
that the Aggrefs return the partial state in exprType() rather than
the final value's type, which is required so the executor knows how to
form and deform tuples, plus many other things.

> C. Can we consider passing pathtarget to create_gather_path() as that seems
> to save us from inventing new UpperRelationKind? If you are worried about
> adding the new parameter (pathtarget) to create_gather_path(), then I think
> we are already passing it in many other path generation functions, so why
> not for gather path generation as well?

That's a better idea... Changed to that...

> 4A.
> Overall function create_grouping_paths() looks better than previous, but I
> think still it is difficult to read. I think it can be improved by
> generating partial aggregate paths separately as we do for nestloop join
> refer function consider_parallel_nestloop

hmm, perhaps the partial path generation could be moved off to another
static function, although we'd need to pass quite a few parameters to
it, like can_sort, can_hash, partial_grouping_target, grouped_rel,
root. Perhaps it's worth doing, but we still need the
partial_grouping_target for the Gather node, so it's not like that
other function can do all of the parallel stuff... We'd still need
some knowledge of that in create_grouping_paths()

> 4B.
> Rather than directly using create_gather_path(), can't we use
> generate_gather_paths as for all places where we generate gather node,
> generate_gather_paths() is used.

I don't think this is a good fit here, although it would be nice as it
would save having to special case generating the final aggregate paths
on the top of the partial paths. It does not seem that nice as it's
not really that clear if we need to make a combine aggregate node, or
a normal aggregate node on the path. The only way to determine that
would by by checking if it was a GatherPath or not, and that does not
seem like a nice way to go about doing that. Someone might go and
invent something new like MergeGather one day.

> 5.
> +make_partialgroup_input_target(PlannerInfo *root, PathTarget *final_target)
> {
> ..
> ..
> + foreach(lc, final_target->exprs)
> + {
> + Expr *expr = (Expr *) lfirst(lc);
> +
> +
> i++;
> +
> + if (parse->groupClause)
> + {
> + Index sgref = final_target-
>>sortgrouprefs[i];
> +
> + if (sgref && get_sortgroupref_clause_noerr(sgref, parse->groupClause)
> +
> != NULL)
> + {
> + /*
> +
> * It's a grouping column, so add it to the input target as-is.
> + */
> +
> add_column_to_pathtarget(input_target, expr, sgref);
> + continue;
> +
> }
> + }
> +
> + /*
> + * Non-grouping column, so just remember the expression for later
> +
> * call to pull_var_clause.
> + */
> + non_group_cols = lappend(non_group_cols, expr);
> +
> }
> ..
> }
>
> Do we want to achieve something different in the above foreach loop than the
> similar loop in make_group_input_target(), if not then why are they not
> exactly same?

It seems that the problem that causes me to change that around is now
gone. With the change reverted I'm unable to produce the original
crash that I was getting. I know that Tom has done quite a number of
changes to PathTargets while I've been working on this, so perhaps not
surprising. I've reverted that change now.

> 6.
> + /* XXX this causes some redundant cost calculation ... */
> + input_target = set_pathtarget_cost_width(root,
> input_target);
> + return input_target;
>
> Can't we use return set_pathtarget_cost_width() directly rather than
> fetching it in input_target and then returning input_target?

Yes, fixed.

Many thanks for the thorough review.

I've attached an updated patch.

I also tweaked the partial path generation in create_grouping_paths()
so that it only considers sorting the cheapest path, or using any
existing pre-sorted paths, rather than trying to sort every path.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Allow-aggregation-to-happen-in-parallel_2016-03-19.patch application/octet-stream 52.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2016-03-18 13:26:50 Re: Relaxing SSL key permission checks
Previous Message Julien Rouhaud 2016-03-18 12:56:23 Re: Choosing parallel_degree