Re: Parallel Aggregate

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Paul Ramsey <pramsey(at)cleverelephant(dot)ca>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregate
Date: 2016-03-10 14:39:29
Message-ID: CAKJS1f82ELFkufBh-Ywt5rZYRtHkHWewQ=cfjczWu8LtQCMOww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8 March 2016 at 11:15, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> The setrefs.c parts of the patch remain completely broken. I've not
> had time to look at this again yet, sorry.

Ok, so again, apologies for previously sending such a broken patch.
I've since managed to free up a bit of time to work on this, which now
consists of a good bit more than the sum total of my weekly lunch
hours.

The attached patch is a much more complete patch, and quite possible
now review worthy.

I set about solving the setrefs.c problem by inventing a PartialAggref
node type which wraps up Aggrefs when they're in a agg node which has
finalizeAggs = false. This node type exists all the way until executor
init time, where it's then removed and replaced with the underlying
Aggref. This seems to solve the targetlist return type issue.
I'd really like some feedback on this method of solving that problem.

I've also fixed numerous other bugs, including the HAVING clause problem.

Things left to do:
1. Make a final pass of the patch not at 3am.
2. Write some tests.
3. I've probably missed a few places that should handle T_PartialAggref.

A couple of things which I'm not 100% happy with.

1. make_partialgroup_input_target() doing lookups to the syscache.
Perhaps this job can be offloaded to a new function in a more suitable
location. Ideally the Aggref would already store the required
information, but I don't see a great place to be looking that up.
2. I don't really like how I had to add tlist to
create_grouping_paths(), but I didn't want to go to the trouble of
calculating the partial agg PathTarget if Parallel Aggregation is not
possible, as this does do syscache lookups, so it's not free, so I'd
rather only do it if we're actually going to add some parallel paths.
3. Something about the force_parallel_mode GUC. I'll think about this
when I start to think about how to test this, as likely I'll need
that, else I'd have to create tables bigger than we'd want to in the
regression tests.

I've also attached an .sql file with an aggregate function aimed to
test the new PartialAggref stuff works properly, as previously it
seemed to work by accident with sum(int).

Just some numbers to maybe make this more interesting:

create table t (num int not null, one int not null, ten int not null,
hundred int not null, thousand int not null, tenk int not null,
hundredk int not null, million int not null);
insert into t select
x,1,x%10+1,x%100+1,x%1000+1,x%10000+1,x%100000+1,x%1000000 from
generate_series(1,10000000)x(x);

-- Serial Plan
# explain select sum(num) from t;
QUERY PLAN
-------------------------------------------------------------------
Aggregate (cost=198530.00..198530.01 rows=1 width=8)
-> Seq Scan on t (cost=0.00..173530.00 rows=10000000 width=4)

# select sum(num) from t;
sum
----------------
50000005000000
(1 row)
Time: 1036.119 ms

# set max_parallel_degree=4;

-- Parallel Plan
# explain select sum(num) from t;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate (cost=105780.52..105780.53 rows=1 width=8)
-> Gather (cost=105780.00..105780.51 rows=5 width=8)
Number of Workers: 4
-> Partial Aggregate (cost=104780.00..104780.01 rows=1 width=8)
-> Parallel Seq Scan on t (cost=0.00..98530.00
rows=2500000 width=4)
(5 rows)

# select sum(num) from t;
sum
----------------
50000005000000
(1 row)

Time: 379.117 ms

I'll try and throw a bit parallel aggregate work to a 4 socket / 64
core server which I have access to... just for fun.

Reviews are now welcome.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
parallel_aggregation_cc3763e_2016-03-11.patch application/octet-stream 41.3 KB
parallel_aggregation_test.sql application/octet-stream 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-03-10 14:43:57 Re: Floating point timestamps
Previous Message Alvaro Herrera 2016-03-10 14:36:31 Re: utils/misc: Simplify search of end of argv in save_ps_display_args()