Quick Links

Push down Aggregates below joins

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Push down Aggregates below joins
Date:	2018-06-20 20:12:51
Message-ID:	c165b72e-8dbb-2a24-291f-113aeb67b76a@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Currently, the planner always first decides the scan/join order, and
adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
and beneficial, to perform the aggregation below a join. I've been
hacking on a patch to allow that.

For example:

create temp table a (id int4 primary key);
create temp table b (id int4);

insert into a select g from generate_series(1, 1000) g;
insert into b select g/10 from generate_series(1, 10000) g;
analyze a,b;

explain select b.id from a, b where a.id = b.id group by b.id;

Currently, you get a plan like this:

QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=323.64..333.65 rows=1001 width=4)
Group Key: b.id
-> Hash Join (cost=27.50..298.66 rows=9990 width=4)
Hash Cond: (b.id = a.id)
-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=4)
-> Hash (cost=15.00..15.00 rows=1000 width=4)
-> Seq Scan on a (cost=0.00..15.00 rows=1000 width=4)
(7 rows)

With the patch, you get a plan like this:

QUERY PLAN
-------------------------------------------------------------------------
Hash Join (cost=192.52..221.27 rows=9990 width=4)
Hash Cond: (a.id = b.id)
-> Seq Scan on a (cost=0.00..15.00 rows=1000 width=4)
-> Hash (cost=180.01..180.01 rows=1001 width=4)
-> HashAggregate (cost=170.00..180.01 rows=1001 width=4)
Group Key: b.id
-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=4)
(7 rows)

This is faster, because you need to join fewer rows. (That's not
reflected in the cost estimates above; cost estimation is not done
properly in this prototype yet.)

Implementation
--------------

Move the responsibility of planning aggregates from the "upper stages",
in grouping_planner(), into scan/join planning, in query_planner().

In query_planner(), after building the RelOptInfo for a scan or join
rel, also build a grouped RelOptInfo to shadow each RelOptInfo (if
aggregation can be done at that rel). The grouped RelOptInfo is stored
in a new 'grouped_rel' field in the parent RelOptInfo.

A grouped rel holds Paths where the grouping/aggregation is already
performed at that node, or below it. For a base rel, it represents
performing the aggregation on top of the scan, i.e. the Paths are like
Agg(Scan). For a grouped join rel, the paths look like an Agg(Join(A,
B)), or Join(Agg(A), B).

The first three of the attached patches just move existing code around.
The fourth patch contains the actual feature.

This is still a rough prototype, but any thoughts on the general approach?

- Heikki

Attachment	Content-Type	Size
0001-Refactor-query_planner-into-two-parts.patch	text/x-patch	17.7 KB
0002-Move-GROUP-BY-planning-code-from-planner.c-to-separa.patch	text/x-patch	261.3 KB
0003-Move-find_em_expr_for_rel-into-the-backend.patch	text/x-patch	4.1 KB
0004-Plan-aggregation-in-query_planner-to-allow-aggregati.patch	text/x-patch	86.3 KB

Responses

Re: Push down Aggregates below joins at 2018-06-20 21:05:38 from Tomas Vondra
Re: Push down Aggregates below joins at 2018-06-20 21:05:55 from Tomas Vondra
Re: Push down Aggregates below joins at 2018-07-01 11:40:55 from Peter Eisentraut

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Mukesh Chhatani	2018-06-20 20:13:24	Re: Postgres 10.4 crashing when using PLV8
Previous Message	David G. Johnston	2018-06-20 20:02:19	Re: Postgres 10.4 crashing when using PLV8