Re: Parallel grouping sets

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Richard Guo <riguo(at)pivotal(dot)io>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel grouping sets
Date: 2019-06-14 00:44:38
Message-ID: 20190614004438.e2y7ubdnpobq5t37@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 14, 2019 at 12:02:52PM +1200, David Rowley wrote:
>On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:
>
>> ># explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
>> >grouping sets((c1,c2), (c1));
>> > QUERY PLAN
>> >--------------------------------------------------------------
>> > Finalize GroupAggregate
>> > Output: c1, c2, avg(c3), (gset_id)
>> > Group Key: t2.c1, t2.c2, (gset_id)
>> > -> Gather Merge
>> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> > Workers Planned: 2
>> > -> Sort
>> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> > Sort Key: t2.c1, t2.c2, (gset_id)
>> > -> Partial HashAggregate
>> > Output: c1, c2, gset_id, PARTIAL avg(c3)
>> > Hash Key: t2.c1, t2.c2
>> > Hash Key: t2.c1
>> > -> Parallel Seq Scan on public.t2
>> > Output: c1, c2, c3
>> >(15 rows)
>> >
>>
>> OK, I'm not sure I understand the point of this - can you give an
>> example which is supposed to benefit from this? Where does the speedup
>> came from?
>
>I think this is a bad example since the first grouping set is a
>superset of the 2nd. If those were independent and each grouping set
>produced a reasonable number of groups then it may be better to do it
>this way instead of grouping by all exprs in all grouping sets in the
>first phase, as is done by #1. To do #2 would require that we tag
>the aggregate state with the grouping set that belong to, which seem
>to be what gset_id is in Richard's output.
>

Aha! So if we have grouping sets (a,b) and (c,d), then with the first
approach we'd do partial aggregate on (a,b,c,d) - which may produce
quite a few distinct groups, making it inefficient. But with the second
approach, we'd do just (a,b) and (c,d) and mark the rows with gset_id.

Neat!

>In my example upthread the first phase of aggregation produced a group
>per input row. Method #2 would work better for that case since it
>would only produce 2000 groups instead of 1 million.
>
>Likely both methods would be good to consider, but since #1 seems much
>easier than #2, then to me it seems to make sense to start there.
>

Yep. Thanks for the explanation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-06-14 00:46:30 Re: Improve handling of pg_stat_statements handling of bind "IN" variables
Previous Message Michael Paquier 2019-06-14 00:42:31 Re: Fix inconsistencies for v12 (pass 2)