Re: Parallel grouping sets

From: Pengzhou Tang <ptang(at)pivotal(dot)io>
To: Richard Guo <riguo(at)pivotal(dot)io>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel grouping sets
Date: 2019-09-30 09:41:23
Message-ID: CAG4reARMcyn+X8gGRQEZyt32NoHc9MfznyPsg_C_V9G+dnQ15Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Richard & Tomas:

I followed the idea of the second approach to add a gset_id in the
targetlist of the first stage of
grouping sets and uses it to combine the aggregate in final stage. gset_id
stuff is still kept
because of GROUPING() cannot uniquely identify a grouping set, grouping
sets may contain
duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)).

There are some differences to implement the second approach comparing to
the original idea from
Richard, gset_id is not used as additional group key in the final stage,
instead, we use it to
dispatch the input tuple to the specified grouping set directly and then do
the aggregate.
One advantage of this is that we can handle multiple rollups with better
performance without APPEND node.

the plan now looks like:

gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize MixedAggregate (cost=1000.00..73108.57 rows=8842 width=12)
Dispatched by: (GROUPINGSETID())
Hash Key: c1, c2
Hash Key: c1
Hash Key: c3
Group Key: ()
Group Key: ()
-> Gather (cost=1000.00..71551.48 rows=17684 width=16)
Workers Planned: 2
-> Partial MixedAggregate (cost=0.00..68783.08 rows=8842
width=16)
Hash Key: c1, c2
Hash Key: c1
Hash Key: c3
Group Key: ()
Group Key: ()
-> Parallel Seq Scan on gstest (cost=0.00..47861.33
rows=2083333 width=12)
(16 rows)

gpadmin=# set enable_hashagg to off;
gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=657730.66..663207.45 rows=8842 width=12)
Dispatched by: (GROUPINGSETID())
Group Key: c1, c2
Sort Key: c1
Group Key: c1
Group Key: ()
Group Key: ()
Sort Key: c3
Group Key: c3
-> Sort (cost=657730.66..657774.87 rows=17684 width=16)
Sort Key: c1, c2
-> Gather (cost=338722.94..656483.04 rows=17684 width=16)
Workers Planned: 2
-> Partial GroupAggregate (cost=337722.94..653714.64
rows=8842 width=16)
Group Key: c1, c2
Group Key: c1
Group Key: ()
Group Key: ()
Sort Key: c3
Group Key: c3
-> Sort (cost=337722.94..342931.28 rows=2083333
width=12)
Sort Key: c1, c2
-> Parallel Seq Scan on gstest
(cost=0.00..47861.33 rows=2083333 width=12)

References:
[1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
<https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3

On Wed, Jul 31, 2019 at 4:07 PM Richard Guo <riguo(at)pivotal(dot)io> wrote:

> On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
>> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo(at)pivotal(dot)io> wrote:
>> >
>> >> Hi all,
>> >>
>> >> Paul and I have been hacking recently to implement parallel grouping
>> >> sets, and here we have two implementations.
>> >>
>> >> Implementation 1
>> >> ================
>> >>
>> >> Attached is the patch and also there is a github branch [1] for this
>> >> work.
>> >>
>> >
>> >Rebased with the latest master.
>> >
>>
>> Hi Richard,
>>
>> thanks for the rebased patch. I think the patch is mostly fine (at least I
>> don't see any serious issues). A couple minor comments:
>>
>
> Hi Tomas,
>
> Thank you for reviewing this patch.
>
>
>>
>> 1) I think get_number_of_groups() would deserve a short explanation why
>> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
>> same branch. Before these cases were clearly separated, now it seems a bit
>> mixed up and it may not be immediately obvious why it's OK.
>>
>
> Added a short comment in get_number_of_groups() explaining the behavior
> when doing partial aggregation for grouping sets.
>
>
>>
>> 2) There are new regression tests, but they are not added to any schedule
>> (parallel or serial), and so are not executed as part of "make check". I
>> suppose this is a mistake.
>>
>
> Yes, thanks. Added the new regression test in parallel_schedule and
> serial_schedule.
>
>
>>
>> 3) The regression tests do check plan and results like this:
>>
>> EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
>> SELECT ... ORDER BY 1, 2, 3;
>>
>> which however means that the query might easily use a different plan than
>> what's verified in the eplain (thanks to the additional ORDER BY clause).
>> So I think this should explain and execute the same query.
>>
>> (In this case the plans seems to be the same, but that may easily change
>> in the future, and we could miss it here, failing to verify the results.)
>>
>
> Thank you for pointing this out. Fixed it in V4 patch.
>
>
>>
>> 4) It might be a good idea to check the negative case too, i.e. a query on
>> data set that we should not parallelize (because the number of partial
>> groups would be too high).
>>
>
> Yes, agree. Added a negative case.
>
>
>>
>>
>> Do you have any plans to hack on the second approach too? AFAICS those two
>> approaches are complementary (address different data sets / queries), and
>> it would be nice to have both. One of the things I've been wondering is if
>> we need to invent gset_id as a new concept, or if we could simply use the
>> existing GROUPING() function - that uniquely identifies the grouping set.
>>
>>
> Yes, I'm planning to hack on the second approach in short future. I'm
> also reconsidering the gset_id stuff since it brings a lot of complexity
> for the second approach. I agree with you that we can try GROUPING()
> function to see if it can replace gset_id.
>
> Thanks
> Richard
>

Attachment Content-Type Size
0001-Support-for-parallel-grouping-sets.patch application/octet-stream 63.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message M Beena Emerson 2019-09-30 09:58:53 Drop Trigger Mechanism with Detached partitions
Previous Message Jeevan Chalke 2019-09-30 09:31:25 Re: backup manifests