Re: Push down Aggregates below joins

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Push down Aggregates below joins
Date: 2018-08-03 14:50:11
Message-ID: 18751.1533307811@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Antonin Houska <ah(at)cybertec(dot)at> wrote:

> I didn't have enough time to separate "your functionality" and can do it when
> I'm back from vacation.

So I've separated the code that does not use the 2-stage replication (and
therefore the feature is not involved in parallel queries).

Note on coding: so far most of the functions to which I added the "grouped"
argument can get the same information from rel->agg_info (it's set iff the
relation is grouped). So we can remove it for this "simple aggregation", but
the next (more generic) part of the patch will have to add some argument
anyway, to indicate whether AGGSPLIT_SIMPLE, AGGSPLIT_INITIAL_SERIAL (or no
aggregation) should be performed.

Besides splitting the code I worked on the determination whether join
duplicates grouping keys or not. Currently it still fails to combine
"uniquekeys" of joined relation in some (important) cases when the information
needed is actually available. I think ECs should be used here, like it is for
pathkeys.

So currently you should comment out this code

if (!match_uniquekeys_to_group_pathkeys(root, result, target))
*keys_ok = false;

in pathnode.c:make_uniquekeys_for_join() if you want the patch to at leat
produce interesting EXPLAIN output.

One example:

CREATE TABLE a(i int primary key);
CREATE TABLE b(j int primary key, k int);

SET enable_agg_pushdown TO true;

EXPLAIN
SELECT j, sum(k)
FROM a, b
WHERE i = j
GROUP BY j

QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=94.75..162.43 rows=2260 width=12)
Hash Cond: (a.i = b.j)
-> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=66.50..66.50 rows=2260 width=12)
-> HashAggregate (cost=43.90..66.50 rows=2260 width=12)
Group Key: b.j
-> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8)

However there are cases like this

EXPLAIN
SELECT i, sum(k)
FROM a, b
WHERE i = j
GROUP BY i

which currently does not work. The reason is that the column b.j which is not
in the GROUP BY clause needs to be in the grouped output of the "b" (grouped)
table output, otherwise the join condition cannot be evaluated.

While separating the code that only uses 1-stage aggregation I removed the
code that adds such extra grouping keys to per-relation AggPath because in
general this is only safe if the final aggregation is performed, and the final
aggregation uses no added columns. However I forgot that grouping keys can be
added in cases like shown above, i.e. the grouping expression b.j is derived
from GROUP BY using equivalence class.

I'll fix this (and various other problems) asap. I believe it's worth to at
least show the current code. I'm curious if it's something we can build on or
if another rework will be needed.

(I'll be off next week.)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Attachment Content-Type Size
agg_pushdown_simple.patch text/x-diff 231.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-08-03 15:21:38 Re: Fallout from PQhost() semantics changes
Previous Message Alvaro Herrera 2018-08-03 14:38:15 Re: Explain buffers wrong counter with parallel plans