Re: [HACKERS] Partition-wise aggregation/grouping

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [HACKERS] Partition-wise aggregation/grouping
Date: 2017-12-12 10:13:11
Message-ID: CAFjFpReOmtwo32YuH7hkYy3e1wTKQwvEd8q1ryVjWPknkY+_Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are review comments for 0009
Only full aggregation is pushed on the remote server.

I think we can live with that for a while but we need to be able to push down
partial aggregates to the foreign server. I agree that it needs some
infrastructure to serialized and deserialize the partial aggregate values,
support unpartitioned aggregation first and then work on partitioned
aggregates. That is definitely a separate piece of work.

+-- ===================================================================
+-- test partition-wise-aggregates
+-- ===================================================================
+CREATE TABLE pagg_tab (a int, b int, c text) PARTITION BY RANGE(a);
+CREATE TABLE pagg_tab_p1 (a int, b int, c text);
+CREATE TABLE pagg_tab_p2 (a int, b int, c text);
+CREATE TABLE pagg_tab_p3 (a int, b int, c text);

Like partition-wise join testcases please use LIKE so that it's easy to change
the table schema if required.

+INSERT INTO pagg_tab_p1 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 10;
+INSERT INTO pagg_tab_p2 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 20 and (i %
30) >= 10;
+INSERT INTO pagg_tab_p3 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 30 and (i %
30) >= 20;

We have to do this because INSERT tuple routing to a foreign partition is not
supported right now. Somebody has to remember to change this to a single
statement once that's done.

+ANALYZE fpagg_tab_p1;
+ANALYZE fpagg_tab_p2;
+ANALYZE fpagg_tab_p3;

I thought this is not needed. When you ANALYZE the partitioned table, it would
analyze the partitions as well. But I see that partition-wise join is also
ANALYZING the foreign partitions separately. When I ran ANALYZE on a
partitioned table with foreign partitions, statistics for only the local tables
(partitioned and partitions) was updated. Of course this is separate work, but
probably needs to be fixed.

+-- When GROUP BY clause matches with PARTITION KEY.
+-- Plan when partition-wise-agg is disabled

s/when/with/

+-- Plan when partition-wise-agg is enabled

s/when/with/

+ -> Append

Just like ForeignScan node's Relations tell what kind of ForeignScan this is,
may be we should annotate Append to tell whether the children are joins,
aggregates or relation scans. That might be helpful. Of course as another
patch.

+SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING
avg(b) < 25 ORDER BY 1;
+ a | sum | min | count
+----+------+-----+-------
+ 0 | 2000 | 0 | 100
+ 1 | 2100 | 1 | 100
[ ... clipped ...]
+ 23 | 2300 | 3 | 100
+ 24 | 2400 | 4 | 100
+(15 rows)

May be we want to reduce the number of rows to a few by using a stricter HAVING
clause?

+
+-- When GROUP BY clause not matches with PARTITION KEY.

... clause does not match ...

+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
sum(a) < 800 ORDER BY 1;
+
QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Sort
+ Output: fpagg_tab_p1.b, (avg(fpagg_tab_p1.a)),
(max(fpagg_tab_p1.a)), (count(*))
+ -> Partial HashAggregate
[ ... clipped ... ]
+ Output: fpagg_tab_p3.b, PARTIAL
avg(fpagg_tab_p3.a), PARTIAL max(fpagg_tab_p3.a), PARTIAL count(*),
PARTIAL sum(fpagg_tab_p3.a)
+ Group Key: fpagg_tab_p3.b
+ -> Foreign Scan on public.fpagg_tab_p3
+ Output: fpagg_tab_p3.b, fpagg_tab_p3.a
+ Remote SQL: SELECT a, b FROM public.pagg_tab_p3
+(26 rows)

I think we interested in overall shape of the plan and not the details of
Remote SQL etc. So, may be turn off VERBOSE. This comment applies to an earlier
plan with enable_partition_wise_agg = false;

+
+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
sum(a) < 800 ORDER BY 1;
+ b | avg | max | count
+----+---------------------+-----+-------
+ 0 | 10.0000000000000000 | 20 | 60
+ 1 | 11.0000000000000000 | 21 | 60
[... clipped ...]
+ 42 | 12.0000000000000000 | 22 | 60
+ 43 | 13.0000000000000000 | 23 | 60
+(20 rows)

Since the aggregates were not pushed down, I doubt we should be testing the
output. But this test is good to check partial aggregates over foreign
partition scans, which we don't have in postgres_fdw.sql I think. So, may be
add it as a separate patch?

Can you please add a test where we reference a whole-row; that usually has
troubles.

- if (root->hasHavingQual && query->havingQual)
+ if (root->hasHavingQual && fpinfo->havingQual)

This is not exactly a problem with your patch, but why do we need to check both
the boolean and the actual clauses? If the boolean is true, query->havingQual
should be non-NULL and NULL otherwise.

/* Grouping information */
List *grouped_tlist;
+ PathTarget *grouped_target;
+ Node *havingQual;

I think we don't need havingQual as a separate member. foreign_grouping_ok()
separates the clauses in havingQual into shippable and non-shippable clauses
and saves in local_conditions and remote_conditions. Probably we want to use
those instead of adding a new member.

index 04e43cc..c8999f6 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -62,7 +62,8 @@ typedef void (*GetForeignJoinPaths_function)
(PlannerInfo *root,
typedef void (*GetForeignUpperPaths_function) (PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
- RelOptInfo *output_rel);
+ RelOptInfo *output_rel,
+ UpperPathExtraData *extra);

Probably this comment belongs to 0007, but it's in this patch that it becomes
clear how invasive UpperPathExtraData changes are. While UpperPathExtraData has
upper paths in the name, all of its members are related to grouping. That's
fine since we only support partition-wise aggregate and not the other upper
operations. But if we were to do that in future, which of these members would
be applicable to other upper relations? inputRows, pathTarget,
partialPathTarget may be applicable to other upper rels as well. can_sort,
can_hash may be applicable to DISTINCT, SORT relations. isPartial and
havingQual will be applicable only to Grouping/Aggregation. So, may be it's ok,
and like RelOptInfo we may separate them by comments.

Another problem with that structure is its name doesn't mention that the
structure is used only for child upper relations, whereas the code assumes that
if extra is not present it's a parent upper relation. May be we want to rename
it to that effect or always use it whether for a parent or a child relation.

We may want to rename pathTarget and partialPathTarget as relTarget and
partialRelTarget since those targets are not specific to any path, but will be
applicable to all the paths created for that rel.

On Mon, Dec 4, 2017 at 7:44 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Sat, Dec 2, 2017 at 4:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Dec 1, 2017 at 7:41 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> This code creates plans where there are multiple Gather nodes under an Append
>>> node.
>>
>> We should avoid that. Starting and stopping workers is inefficient,
>> and precludes things like turning the Append into a Parallel Append.
>
> Ah, I didn't think about it. Thanks for bringing it up.
>
>>
>>> AFAIU, the workers assigned to one gather node can be reused until that
>>> Gather node finishes. Having multiple Gather nodes under an Append mean that
>>> every worker will be idle from the time that worker finishes the work till the
>>> last worker finishes the work.
>>
>> No, workers will exit as soon as they finish. They don't hang around idle.
>
> Sorry, I think I used wrong word "idle". I meant that if a worker
> finishes and exists, the query can't use it that worker slot until the
> next Gather node starts. But as you pointed out, starting and stopping
> a worker is costlier than the cost of not using the slot. So we should
> avoid such plans.
>
>>
>>> index b422050..1941468 100644
>>> --- a/src/tools/pgindent/typedefs.list
>>> +++ b/src/tools/pgindent/typedefs.list
>>> @@ -2345,6 +2345,7 @@ UnlistenStmt
>>> UnresolvedTup
>>> UnresolvedTupData
>>> UpdateStmt
>>> +UpperPathExtraData
>>> UpperRelationKind
>>> UpperUniquePath
>>> UserAuth
>>>
>>> Do we commit this file as part of the feature?
>>
>> Andres and I regularly commit such changes; Tom rejects them.
>>
>
> We will leave it to the committer to decide what to do with this hunk.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2017-12-12 10:30:39 Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD
Previous Message Rushabh Lathia 2017-12-12 10:09:29 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)