Re: Parallel grouping sets

From: Richard Guo <riguo(at)pivotal(dot)io>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel grouping sets
Date: 2019-06-13 10:24:04
Message-ID: CAN_9JTzmdKFCTOXWcLDJQpiYLCTH9K8jJiiOv7t+wDfc9Wo5qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 13, 2019 at 12:29 PM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
wrote:

> On Wed, 12 Jun 2019 at 14:59, Richard Guo <riguo(at)pivotal(dot)io> wrote:
> > Implementation 1
>
> > Parallel aggregation has already been supported in PostgreSQL and it is
> > implemented by aggregating in two stages. First, each worker performs an
> > aggregation step, producing a partial result for each group of which
> > that process is aware. Second, the partial results are transferred to
> > the leader via the Gather node. Finally, the leader merges the partial
> > results and produces the final result for each group.
> >
> > We are implementing parallel grouping sets in the same way. The only
> > difference is that in the final stage, the leader performs a grouping
> > sets aggregation, rather than a normal aggregation.
>
> Hi Richard,
>
> I think it was you an I that discussed #1 at unconference at PGCon 2
> weeks ago. The good thing about #1 is that it can be implemented as
> planner-only changes just by adding some additional paths and some
> costing. #2 will be useful when we're unable to reduce the number of
> inputs to the final aggregate node by doing the initial grouping.
> However, since #1 is easier, then I'd suggest going with it first,
> since it's the path of least resistance. #1 should be fine as long as
> you properly cost the parallel agg and don't choose it when the number
> of groups going into the final agg isn't reduced by the partial agg
> node. Which brings me to:
>

Hi David,

Yes. Thank you for the discussion at PGCon. I learned a lot from that.
And glad to meet you here. :)

I agree with you on going with #1 first.

>
> You'll need to do further work with the dNumGroups value. Since you're
> grouping by all the columns/exprs in the grouping sets you'll need the
> number of groups to be an estimate of that.
>

Exactly. The v1 patch estimates number of partial groups incorrectly, as
it calculates the number of groups for each grouping set and then add
them for dNumPartialPartialGroups, while we actually should calculate
the number of groups for all the columns in the grouping sets. I have
fixed this issue in v2 patch.

>
> Here's a quick test I did that shows the problem:
>
> create table abc(a int, b int, c int);
> insert into abc select a,b,1 from generate_Series(1,1000)
> a,generate_Series(1,1000) b;
> create statistics abc_a_b_stats (ndistinct) on a,b from abc;
> analyze abc;
>
> -- Here the Partial HashAggregate really should estimate that there
> will be 1 million rows.
> explain analyze select a,b,sum(c) from abc group by grouping sets
> ((a),(b));
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------------
> Finalize HashAggregate (cost=14137.67..14177.67 rows=2000 width=16)
> (actual time=1482.746..1483.203 rows=2000 loops=1)
> Hash Key: a
> Hash Key: b
> -> Gather (cost=13697.67..14117.67 rows=4000 width=16) (actual
> time=442.140..765.931 rows=1000000 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Partial HashAggregate (cost=12697.67..12717.67 rows=2000
> width=16) (actual time=402.917..526.045 rows=333333 loops=3)
> Group Key: a, b
> -> Parallel Seq Scan on abc (cost=0.00..9572.67
> rows=416667 width=12) (actual time=0.036..50.275 rows=333333 loops=3)
> Planning Time: 0.140 ms
> Execution Time: 1489.734 ms
> (11 rows)
>
> but really, likely the parallel plan should not be chosen in this case
> since we're not really reducing the number of groups going into the
> finalize aggregate node. That'll need to be factored into the costing
> so that we don't choose the parallel plan when we're not going to
> reduce the work in the finalize aggregate node. I'm unsure exactly how
> that'll look. Logically, I think the choice parallelize or not to
> parallelize needs to be if (cost_partial_agg + cost_gather +
> cost_final_agg < cost_agg) { do it in parallel } else { do it in
> serial }. If you build both a serial and parallel set of paths then
> you should see which one is cheaper without actually constructing an
> "if" test like the one above.
>

Both the serial and parallel set of paths would be built and the cheaper
one will be selected. So we don't need the 'if' test.

With v2 patch, the parallel plan will not be chosen for the above query:

# explain analyze select a,b,sum(c) from abc group by grouping sets
((a),(b));
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=20406.00..25426.00 rows=2000 width=16) (actual
time=935.048..935.697 rows=2000 loops=1)
Hash Key: a
Hash Key: b
-> Seq Scan on abc (cost=0.00..15406.00 rows=1000000 width=12) (actual
time=0.041..170.906 rows=1000000 loops=1)
Planning Time: 0.240 ms
Execution Time: 935.978 ms
(6 rows)

>
> Here's a simple group by with the same group by clause items as you
> have in the plan above that does get the estimated number of groups
> perfectly. The plan above should have the same estimate.
>
> explain analyze select a,b,sum(c) from abc group by a,b;
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------
> GroupAggregate (cost=132154.34..152154.34 rows=1000000 width=16)
> (actual time=404.304..1383.343 rows=1000000 loops=1)
> Group Key: a, b
> -> Sort (cost=132154.34..134654.34 rows=1000000 width=12) (actual
> time=404.291..620.774 rows=1000000 loops=1)
> Sort Key: a, b
> Sort Method: external merge Disk: 21584kB
> -> Seq Scan on abc (cost=0.00..15406.00 rows=1000000
> width=12) (actual time=0.017..100.299 rows=1000000 loops=1)
> Planning Time: 0.115 ms
> Execution Time: 1412.034 ms
> (8 rows)
>
> Also, in the tests:
>
> > insert into gstest select 1,10,100 from generate_series(1,1000000)i;
> > insert into gstest select 1,10,200 from generate_series(1,1000000)i;
> > insert into gstest select 1,20,30 from generate_series(1,1000000)i;
> > insert into gstest select 2,30,40 from generate_series(1,1000000)i;
> > insert into gstest select 2,40,50 from generate_series(1,1000000)i;
> > insert into gstest select 3,50,60 from generate_series(1,1000000)i;
> > insert into gstest select 1,NULL,000000 from generate_series(1,1000000)i;
> > analyze gstest;
>
> You'll likely want to reduce the number of rows being used just to
> stop the regression tests becoming slow on older machines. I think
> some of the other parallel aggregate tests use must fewer rows than
> what you're using there. You might be able to use the standard set of
> regression test tables too, tenk, tenk1 etc. That'll save the test
> having to build and populate one of its own.
>

Yes, that makes sense. Table size has been reduced in v2 patch.
Currently I do not use the standard regression test tables as I'd like
to customize the table with some specific data for correctness
verification. But we may switch to the standard test table later.

Also in v2 patch, I'v fixed two addition issues. One is about the sort
key for sort-based grouping sets in Partial Aggregate, which should be
all the columns in parse->groupClause. The other one is about
GroupingFunc. Since Partial Aggregate will not handle multiple grouping
sets at once, it does not need to evaluate GroupingFunc. So GroupingFunc
is removed from the targetlists of Partial Aggregate.

Thanks
Richard

Attachment Content-Type Size
v2-0001-Implementing-parallel-grouping-sets.patch application/octet-stream 21.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2019-06-13 11:01:22 Re: [PATCH] Speedup truncates of relation forks
Previous Message Pavel Trukhanov 2019-06-13 10:14:23 Improve handling of pg_stat_statements handling of bind "IN" variables