Re: [HACKERS] Partition-wise aggregation/grouping

From: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Partition-wise aggregation/grouping
Date: 2017-11-23 13:08:16
Message-ID: CAM2+6=XVzdRT5G94aOAD0JsYOLaEa1-f--ck8_5n3aCmdDd=tQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 17, 2017 at 5:54 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke
> <jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
> >
> > OK. Done in the attached patch set.
> >
> > I have rebased all my patches on latest HEAD which is at
> > 7518049980be1d90264addab003476ae105f70d4
> >
> > Thanks
>
> These are review comments for the last set and I think most of them
> apply to the new set as well.
>
> Patches 0001 - 0005 refactoring existing code. I haven't
> reviewed them in detail, checking whether we have missed anything in
> moving the
> code, but they mostly look fine.
>

Thanks.

>
> Comments on 0006
> /*
> + * cost_append
> + * Determines and returns the cost of an Append node.
> + *
> ... clipped portion
> +
> + /* Add Append node overhead. */
> + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
> +
>
> I am wondering whether it's really worth creating a new function for a
> single
> line addition to create_append_path(). I think all we need to change in
> create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR *
> tuples) to path->total_cost.
>

Agree. However, there was ab existing comment in create_append_path() saying
"We don't bother with inventing a cost_append(), but just do it here", which
implies at sometime in future we may need it; so why not now where we are
explicitly costing for an append node. Having a function is good so that,
if required in future, we need update in only this function.
Let me know if you think otherwise, I make those changes in next patchset.

>
> + /* Add MergeAppend node overhead like we do it for the Append node */
> + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
> +
>
> With this change the following comment is no more true. Please remove it.
> * extracted tuple. We don't charge cpu_tuple_cost because a
> MergeAppend
> * node doesn't do qual-checking or projection, so it has less overhead
> * than most plan nodes.
> */
>
>
This was already fixed in v7.

+/*
> + * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note
> that
>
> May be reword it as " ... to cost per tuple processing by an append node
> ..."
>

Done.

>
> + * this value should be multiplied with cpu_tuple_cost wherever
> applicable.
> + */
> +#define DEFAULT_APPEND_COST_FACTOR 0.5
>
> I am wondering whether we should just define
> #define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
> and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR
> would
> have other than multiplying with cpu_tuple_cost?
>

As suggested by Robert, I have renamed it to APPEND_CPU_COST_MULTIPLIER in
v7 patchset.
Also, retained the #define for just multiplier as suggested by Robert.

> -- test partition matching with N-way join
> EXPLAIN (COSTS OFF)
> SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
> plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
> t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
> - QUERY PLAN
> ------------------------------------------------------------
> ---------------------------
> + QUERY PLAN
> +-----------------------------------------------------------
> ---------------------
> Sort
> Sort Key: t1.c, t3.c
> -> HashAggregate
> Group Key: t1.c, t2.c, t3.c
> - -> Result
> + -> Hash Join
> + Hash Cond: (t1.c = t2.c)
> -> Append
> - -> Hash Join
> - Hash Cond: (t1.c = t2.c)
>
> That's sad. Interestingly this query has an aggregate, so the plan will use
> partition-wise join again when partition-wise aggregation patch will be
> applied. So may be fine.
>

Yep. I have modified this testcase and enabled partition-wise aggregation
before this test, so that we will see the desired plan.

> - Append (cost=0.00..0.04 rows=2 width=32)
> + Append (cost=0.00..0.05 rows=2 width=32)
>
> - Append (cost=0.00..0.04 rows=2 width=4)
> + Append (cost=0.00..0.05 rows=2 width=4)
>
> We do have some testcases which print costs. Interesting :). I don't have
> any objection to this change.
>

OK. Thanks.

> Comments on 0007
>
> + <para>
> + Enables or disables the query planner's use of partition-wise
> grouping
> + or aggregation, which allows If partition-wise aggregation
> does not result in the
> + cheapest path, it will still spend time in creating these paths
> and
> + consume memory which increase linearly with the number of
> partitions.
> + The default is <literal>off</>.
> + </para>
> + </listitem>
> + </varlistentry>
> +
> May be we should word this in the same manner as partition-wise join like
>
> Enables or disables the query planner's use of partition-wise
> grouping
> or aggregation, which allows aggregation or grouping on a
> partitioned
> tables to be spread across the partitions. If <literal>GROUP
> BY<literal> clause includes partition keys, the rows are
> aggregated at
> each partition. Otherwise, partial aggregates computed for each
> partition are required to be combined. Because partition-wise
> aggregation/gropuing can use significantly more CPU time and memory
> during planning, the default is <literal>off</literal>.
>

Done. Thanks for the new wordings.

> +
> +Partition-wise aggregates/grouping
> +----------------------------------
>
> ... clipped patch
>
> +In above plan, aggregation is performed after append node which means
> that the
> +whole table is an input for the aggregation node. However, with
> partition-wise
> +aggregation, same query will have plane like:
>
> s/plane/plan/
>

Oops. Fixed.

>
> + Append
>
> ... clipped patch
>
> +PartialAggregate stage greatly reduces the number of groups and lose if
> we have
> +lots of small groups.
>
> To keep the discussion brief, I suggest we rewrite this paragraph as
>
> ----
> If GROUP BY clause has all partition keys, all the rows that belong to a
> given
> group come from a single partition and thus aggregates can be finalized
> separately for each partition. When the number of groups is far lesser
> than the
> number of rows being grouped, as usually is the case, the number of rows
> processed by an Append node reduces apart from reducing the size of the
> hash
> table or size of the data to be sorted. This usually improves efficiency
> of the
> query. If GROUP BY doesn't contain all the partition keys, partial
> aggregates can be computed for
> each partition followed by combining partial aggregates from one or more
> partitions belonging to the same group to compute complete aggregate for
> each
> group. This improves efficiency of the query if the number of groups is far
> less than the number of rows produced by the scan underneath.
> ---
>
> I am not sure whether we should be discussing why this technique performs
> better or when it performs better. We don't have similar discussion for
> partition-wise join. That paragraph just describes the technique and may
> be we
> want to do the same here.
>

OK.
I have removed the text explaining when it performs better.
Please have a look over new text and let me know your views.

>
> + *
> + * extra is the additional information required when we are doing
> aggregation
> + * or grouping below the append node. In case of partial partition-wise
> + * aggregation on a child node, we need to compute finalized step after
> the
> + * append, which cannot be done in this function. And thus if we have
> non-NULL
> + * value for extra, we call create_partition_agg_paths() to create an
> append
> + * node and finalization, if any.
>
> May be we want to just say "extra provides more information about the
> partitioned aggregation/grouping e.g path target, whether to use partial
> aggregate and so on." When present we call create_partition_agg_paths() to
> add
> paths for partition-wise aggregatges.
>

Done.

>
> - add_path(rel, (Path *) create_append_path(rel, subpaths,
> - rel->reltarget, NULL, 0,
> - partitioned_rels));
> + {
> + if (extra)
> + create_partition_agg_paths(root, rel, subpaths, NIL,
> + NIL, NULL, 0,
> + partitioned_rels, extra);
> + else
> + add_path(rel, (Path *) create_append_path(rel, subpaths,
> + rel->reltarget,
> NULL, 0,
> + partitioned_rels));
> + }
>
> I am wondering whether we could write a function to call appropriate one
> out of
> create_append_path(), create_partition_agg_paths() or
> create_merge_append_path() based on the presence of extra and/or pathkeys
> and
> use it everywhere such a change is made. I don't know whether that will be
> worth the code. But there are a handful places where such diffs are
> required.
>

Done.
Added function named add_append_path() which does the same. Function name
seems too generic, it will be good if you suggest few.

> -
> - plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
> NULL);
> + plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
> + IS_OTHER_REL(best_path->subpath->parent)
> ?
> + best_path->path.parent->relids : NULL);
>
> While I can guess why this change is required, it may be better to
> separate it
> into a patch of its own and adding some explanation in the commit message,
> for
> other reviewers.
>

Done.
Tried adding comments. See whether it make sense or need further
improvements.

> + /* Copy input rels's relids to grouped rel */
> + grouped_rel->relids = input_rel->relids;
>
> I am fine with this change, but Tom may not agree [1]. May be we should
> get his
> opinion on this one.
>

Yep. Agree.
This mainly required for FDW as all_baserels do not have relids for the
child relations. Another solution I think of is to have say, rel->fs_relids
and set that appropriately and use it in create_foreignscan_plan().

> /*
> + * If input relation is partitioned, check if we can perform
> + * partition-wise grouping and/or aggregation.
> + */
>
> Just like partition-wise join a concise "Apply partition-wise aggregation
> technique, if possible." would suffice.
>

Done.

>
> dNumPartialGroups = get_number_of_groups(root,
> cheapest_partial_path->rows,
> gd,
> - parse->targetList);
> +
> make_tlist_from_pathtarget(target));
> Can we guarantee that the output of make_tlist_from_pathtarget() will be
> same
> as translation of parse->targetList for the given child? Even if not, may
> be
> it's fine to pass slightly different tlist to get_number_of_groups() since
> it
> doesn't depend upon the exact shape but right group column references.
> Nonetheless something to test and verify.
>

We are interested here to get the group expressions. get_number_of_groups()
fetches the group expressions from tlist by checking tle->ressortgroupref
which is presumably same as that of path target's sortgrouprefs. So I don't
see any issue here.

> *
> - * Determines whether parallel grouping and/or aggrgation is possible, or
> not.
> + * Determines whether parallel grouping and/or aggregation is possible,
> or not.
> * Returns true when possible, false otherwise.
>
> Does this hunk belong to one of the refactoring patches or as a separate
> patch
> correcting a typo?
>

Oops. Moved to the appropriate refactoring patch.

> +/*
> + * try_partition_wise_grouping
> + *
> + * If the input relation is partitioned and the partition keys are part
> of the
> + * group by clauses, each partition produces a different set of groups.
> + * Aggregates within each such group can be computed partition-wise. This
>
> While these sentences are correct, I think the reason why we could compute
> an
> aggregate at the level of each partition is because rows from a given group
> belong to a single partition. So, I guess, we have to reword this as
>
> "If the partition keys of input relation are part of group by clause, all
> the
> rows belonging to a given group come from a single partition, each
> partition
> producing a different set of groups. This allows aggregation/grouping over
> a
> partitioned relation to be broken down into aggregation/grouping on each
> partition.
>
> If group by clause does not contain all the partition keys, rows from a
> given
> group may be spread across multiple partitions. In that case, we can
> combine
> partial aggregates for a given group across partitions to produce the final
> aggregate for a that group "
>

Done. Thanks.

> + * might be optimal because of presence of suitable paths with pathkeys or
> + * because the hash tables for most of the partitions fit into the memory.
> + * However, if partition keys are not part of the group by clauses, then
> we
> + * still able to compute the partial aggregation for each partition and
> then
> + * finalize them over append. This can either win or lose. It may win if
> the
> + * PartialAggregate stage greatly reduces the number of groups and lose
> if we
> + * have lots of small groups.
>
> I have not seen prologue of a function implementing a query optimization
> technique explain why that technique improves performance. So I am not sure
> whether the comment should include this explanation. One of the reasons
> being
> that the reasons why a technique works might change over the period of time
> with the introduction of other techniques, thus obsoleting the comment.
> But
> may be it's good to have it here.
>

Yep. Retained.

> + /*
> + * Grouping sets plan does not work with an inheritance subtree (see
> notes
> + * in create_groupingsets_plan). Thus do not handle grouping sets
> here.
> + */
> + if (query->groupingSets || gd)
> + return;
>
> Even if that restriction is lifted, we won't be able to compute
> "whole" grouping sets
> for each partition, since grouping sets implies multiple group by clauses,
> each
> of which may not have all partition keys. Those sets which have all
> partition
> keys will be computed completely for each partition, but others will
> require
> partial aggregation. I guess, we will need to apply partition-wise
> aggregation
> at each derived group by clause and not as a whole-sale strategy.
>

Done.

>
> Anyway, it doesn't look like a good idea to pass an argument (gd) only to
> return from that function in case of its presence. May be we should handle
> it
> outside this function.
>

Well, I would like to have it inside the function itself. Let the function
itself do all the necessary checking rather than doing some of them outside.

> +
> + /* Nothing to do, if the input relation is not partitioned. */
> + if (!input_rel->part_scheme)
> + return;
> +
> + Assert(input_rel->part_rels);
>
> For a join between two partitioned tables with one of them being dummy
> relation, would have part_scheme set but not part_rels (See
> try_partition_wise_join()). This assertion would
> fail in such a case. Have you tested the case? May be we should just test
> if
> input_rel->part_rels exists similar to generate_partition_wise_join_
> paths().
>

Yep. This was already fixed in v7 and also has a covering testcase.

Also, how is a dummy input relation is handled in this function? Do we need
> to
> handle?
>

Yes, we need to handle. Need to return without doing PWA when input
relation itself is dummy.
Added covering testcase for it.

+ nparts = input_rel->nparts;
> + part_rels = (RelOptInfo **) palloc(nparts * sizeof(RelOptInfo *));
> + grouped_rel->part_rels = part_rels;
>
> For a partial aggregation, we can't say that the child rels produced here
> are
> partitions of the top grouped relation, so setting part_rels looks wrong.
> We
> should set this only when a full aggregate is obtained from each partition.
>

Done.

>
> + scanjoin_target =
> copy_pathtarget(input_rel->cheapest_startup_path->pathtarget);
> + scanjoin_target->exprs = (List *) adjust_appendrel_attrs(root,
> +
> (Node *) scanjoin_target->exprs,
> +
> nappinfos,
> +
> appinfos);
>
> Why can't we use input_child_rel->pathtarget? It should be same as the
> translation of its parent's path target. I probably understand that's
> because
> the input rel's path targets have been changed after the underlying join
> was
> planned, a step which is not applied to the individual children. May be
> add a
> comment here?
>

Done. Added comments.

> + child_target->exprs = (List *) adjust_appendrel_attrs(root,
> + (Node
> *) target->exprs,
> + nappinfos,
> + appinfos);
> + partial_target = make_partial_grouping_target(root, target);
> + partial_target->exprs = (List *) adjust_appendrel_attrs(root,
> + (Node
> *) partial_target->exprs,
> + nappinfos,
> + appinfos);
>
> We need both of these steps for any aggregate since parallel paths will
> compute
> parial paths anyway. If that's correct, may be we should add a comment?
>

Done.

>
> + extra.inputRows = 0; /* Not needed at child paths creation */
>
> Why? Comment should be on its own line.
>

It is actually not used in create_child_grouping_paths(). But setting that
value has no side effect, thus set that correctly and removed the comments.

> + if (!create_child_grouping_paths(root, input_child_rel,
> agg_costs, gd,
> + &extra))
> + {
> + /* Could not create path for childrel, return */
> + pfree(appinfos);
> + return;
> + }
>
> Can we detect this condition and bail out even before planning any of the
> children? It looks wasteful to try to plan children only to bail out in
> this
> case.
>

I don't think so. It is like non-reachable and added just for a safety in
case we can't able to create a child path. The bail out conditions cannot
be evaluated at the beginning. Do you this an Assert() will be good here?
Am I missing something?

> + /* Nothing to do if we have no live children */
> + if (live_children == NIL)
> + return;
>
> A parent relation with all dummy children will also be dummy. May be we
> should
> mark the parent dummy case using mark_dummy_rel() similar to
> generate_partition_wise_join_paths().
>

If parent is dummy, then we are not at all doing PWA. So no need to mark
parent grouped_rel as dummy I guess.
However, if some of the children are dummy, I am marking corresponding
upper rel as dummy too.
Actually, this condition will never going to be true as you said correctly
that "A parent relation with all dummy children will also be dummy". Should
we have an Assert() instead?

> +/*
> + * have_grouping_by_partkey
> + *
>
> Somehow this name sounds like it would return true when GROUP BY contains
> only
> partition key. May be rename as group_by_has_partkey? to indicate the
>

OK. Renamed.

> + * Returns true, if partition keys of the given relation are part of the
> + * GROUP BY clauses, false otherwise.
>
> Reword as " ... if all the partition keys of ... "
>

Done.

>
> +static bool
> +have_grouping_by_partkey(RelOptInfo *input_rel, PathTarget *target,
> + List *groupClause)
> +{
> + List *tlist = make_tlist_from_pathtarget(target);
> + List *groupexprs = get_sortgrouplist_exprs(groupClause, tlist);
>
> Have we tested the case with multi-level partitioned table and children
> with
> different order of partition key columns?
>

I have testcase for multi-level partitioned table.
However, I did not understand by what you mean by "children with different
order of partition key columns". I had a look over tests in
partition_join.sql and it seems that I have cover all those scenarios.
Please have a look over testcases added for PWA and let me know the
scenarios missing, I will add them then.

> + partexprs = input_rel->partexprs ? input_rel->partexprs[cnt] :
> NIL;
> +
> + /* Rule out early, if there are no partition keys present */
> + if (partexprs == NIL)
> + return false;
>
> If input_rel->partexprs is NIL, we should "bail" out even before the loop
> starts.
>

Done.

>
> + foreach(lc, partexprs)
> + {
> + Expr *partexpr = lfirst(lc);
> +
> + if (list_member(groupexprs, partexpr))
> + {
> + found = true;
> + break;
> + }
> + }
>
> This looks like a useful piece of general functionality
> list_has_intersection(), which would returns boolean instead of the whole
> intersection. I am not sure whether we should add that function to list.c
> and
> use here.
>

Sounds good.
But for now, I am keeping it as part of this feature itself.

>
> + * If none of the partition key matches with any of the GROUP BY
>
> Reword as "... the partition key expressions match with ...."
>

Done.

>
> This isn't a full review of 0007, but I think it covers most of the new
> functionality.
>
> [1] https://www.postgresql.org/message-id/CAFjFpRdUz6h6cmFZFYAngmQAX8Zvo
> +MZsPXidZ077h=gp9bvQw(at)mail(dot)gmail(dot)com
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>

Thanks for the details review Ashutosh.

Let me know if I missed any comment to be fixed.

Thanks

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment Content-Type Size
partition-wise-agg-v8.tar.gz application/x-gzip 35.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-11-23 13:08:29 Re: PostgreSLQ v10.1 and xlC compiler on AIX
Previous Message Amit Khandekar 2017-11-23 12:57:16 Re: [HACKERS] UPDATE of partition key