Quick Links

Re: Parallel grouping sets

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Richard Guo <riguo(at)pivotal(dot)io>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Parallel grouping sets
Date:	2019-06-14 00:44:38
Message-ID:	20190614004438.e2y7ubdnpobq5t37@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jun 14, 2019 at 12:02:52PM +1200, David Rowley wrote:
>On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:
>
>> ># explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
>> >grouping sets((c1,c2), (c1));
>> > QUERY PLAN
>> >--------------------------------------------------------------
>> > Finalize GroupAggregate
>> > Output: c1, c2, avg(c3), (gset_id)
>> > Group Key: t2.c1, t2.c2, (gset_id)
>> > -> Gather Merge
>> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> > Workers Planned: 2
>> > -> Sort
>> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> > Sort Key: t2.c1, t2.c2, (gset_id)
>> > -> Partial HashAggregate
>> > Output: c1, c2, gset_id, PARTIAL avg(c3)
>> > Hash Key: t2.c1, t2.c2
>> > Hash Key: t2.c1
>> > -> Parallel Seq Scan on public.t2
>> > Output: c1, c2, c3
>> >(15 rows)
>> >
>>
>> OK, I'm not sure I understand the point of this - can you give an
>> example which is supposed to benefit from this? Where does the speedup
>> came from?
>
>I think this is a bad example since the first grouping set is a
>superset of the 2nd. If those were independent and each grouping set
>produced a reasonable number of groups then it may be better to do it
>this way instead of grouping by all exprs in all grouping sets in the
>first phase, as is done by #1. To do #2 would require that we tag
>the aggregate state with the grouping set that belong to, which seem
>to be what gset_id is in Richard's output.
>

Aha! So if we have grouping sets (a,b) and (c,d), then with the first
approach we'd do partial aggregate on (a,b,c,d) - which may produce
quite a few distinct groups, making it inefficient. But with the second
approach, we'd do just (a,b) and (c,d) and mark the rows with gset_id.

Neat!

>In my example upthread the first phase of aggregation produced a group
>per input row. Method #2 would work better for that case since it
>would only produce 2000 groups instead of 1 million.
>
>Likely both methods would be good to consider, but since #1 seems much
>easier than #2, then to me it seems to make sense to start there.
>

Yep. Thanks for the explanation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Parallel grouping sets at 2019-06-14 00:02:52 from David Rowley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-06-14 00:46:30	Re: Improve handling of pg_stat_statements handling of bind "IN" variables
Previous Message	Michael Paquier	2019-06-14 00:42:31	Re: Fix inconsistencies for v12 (pass 2)