Re: [HACKERS] WIP: Aggregation push-down

From: Antonin Houska <ah(at)cybertec(dot)at>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] WIP: Aggregation push-down
Date: 2019-01-15 14:06:18
Message-ID: 10529.1547561178@localhost
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

This is the next version. A few more comments below.

Antonin Houska <ah(at)cybertec(dot)at> wrote:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Antonin Houska <ah(at)cybertec(dot)at> writes:
> Thanks!
> > Conceptually, we'd like to be able to push aggregation down below joins
> > when that yields a win, which it could do by reducing the number of rows
> > that have to be processed at the join stage. Thus consider
> >
> > SELECT agg(a.x) FROM a, b WHERE a.y = b.z;
> >
> > We can't simply aggregate during the scan of "a", because some values of
> > "x" might not appear at all in the input of the naively-computed aggregate
> > (if their rows have no join partner in "b"), or might appear multiple
> > times (if their rows have multiple join partners). So at first glance,
> > aggregating before the join is impossible. The key insight of the patch
> > is that we can make some progress by considering only grouped aggregation:
> Yes, the patch does not handle queries with no GROUP BY clause. Primarily
> because I don't know how the grouped "a" table in this example could emit the
> "a.y" column w/o using it as a grouping key.
> > if we can group "a" into sets of rows that must all have the same join
> > partners, then we can do preliminary aggregation within each such group,
> > and take care of the number-of-repetitions problem when we join. If the
> > groups are large then this reduces the number of rows processed by the
> > join, at the cost that we might spend time computing the aggregate for
> > some row sets that will just be discarded by the join. So now we consider
> >
> > SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;
> >
> > and ask what properties the grouping column(s) "?" must have to make this
> > work. I believe we can say that it's OK to push down through a join if
> > its join clauses are all of the form "a.y = b.z", where either a.y or b.z
> > is listed as a GROUP BY column *and* the join operator is equality for
> > the btree opfamily specified by the SortGroupClause.

I think the requirement for the join clauses to be of the form "a.y = b.z"
would only be valid if we wanted to avoid the 2-stage aggregation. For
example, in this query

SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY b.z

uniqueness of "b.z" ensures that groups resulting from the pushed-down
aggregation (SELECT a.y, agg(a.x) FROM a GROUP BY a.y) do not get duplicated
by the join. However if there's a consensus that the 2-stage aggregation will
be used (because avoidance of it would limit usefulness of the feature too
much), I think we can accept generic join clauses. (Note that the patch does
not try to push aggregate down below nullable side of outer join - handling of
generated NULL values by join clauses would be another problem.)

> > The concerns Robert had about equality semantics are certainly vital, but I
> > believe that the logic goes through correctly as long as the grouping
> > equality operator and the join equality operator belong to the same
> > opfamily.
> ok, generic approach like this is always better. I just could not devise it,
> nor can I prove that your theory is wrong.
> > Robert postulated cases like "citext = text", but that doesn't apply
> > here because no cross-type citext = text operator could be part of a
> > well-behaved opfamily.
> The reason I can see now is that the GROUP BY operator (opfamily) essentially
> has the grouping column on both sides, so it does not match the cross-type
> operator.

Even if my statement that the join clause can be of other kind than "<leftarg>
= <rightarg>" was right, it might still be worth insisting on "<leftargt> <op>
<rightarg>" where <op> is of the same operator family as the grouping

> > Also, Tomas complained in the earlier thread that he didn't think
> > grouping on the join column was a very common use-case in the first
> > place. I share that concern, but I think we could extend the logic
> > to the case that Tomas posited as being useful:
> >
> > SELECT agg(a.x) FROM a, b WHERE a.y = GROUP BY b.z;
> >
> > where the join column is unique. If we group on a.y (using semantics
> > compatible with the join operator and the uniqueness constraint), then all
> > "a" rows in a given group will join to exactly one "b" row that
> > necessarily has exactly one grouping value, so this group can safely be
> > aggregated together. We might need to combine it post-join with other "b"
> > rows that have equal "z" values, but we can do that as long as we're okay
> > with partial aggregation. I think this example shows why the idea is far
> > more powerful with partial aggregation than without.
> Good idea. I'll try to incorporate it in the next version.

The previous version of the patch (before I removed the partial aggregation)
could actually handle this case, so I restored it. It just does not check the
uniqueness of the column because: the final aggregation will eliminate
the duplicated groups anyway.

The patch views the problem from this perspective: "a.y" is needed by join
clause, and the only way to retrieve it from the partially aggregated relation
"a" is to use it as a grouping column. Such an "artificial" grouping
expression can only be used if the plan contains the final aggregate node,
which uses only the original GROUP BY clause.

> > I also have no use for the stuff depending on bitwise equality rather than
> > the user-visible operators; that's just a hack, and an ugly one.
> The purpose of this was to avoid aggregation push-down for data types like
> numeric, see the example
> SELECT one.a
> FROM one, two
> WHERE one.a = two.a AND scale(two.a) > 1
> in [2]. The problem is that aggregation can make some information lost, which
> is needed above by WHERE/ON clauses. I appreciate any suggestions how to
> address this problem.

The new patch version does not contain this hack, but the problem is still
open. Alternatively I think of a boolean flag to be added to pg_opfamily or
pg_opclass catalog, saying whether equality also means "bitwise equality". If
the value is false (the default) for the GROUP BY operator, no aggregate
push-down can take place.

> > I think the right representation is to create UPPERREL_GROUP_AGG
> > RelOptInfos whose relids show the set of baserels joined before
> > aggregating. Currently there's only one of those and its relids is equal
> > to all_baserels (or should be, anyway). This patch would add instances of
> > UPPERREL_GROUP_AGG RelOptInfos with singleton relids, and later we might
> > put in the ability to consider aggregation across other relid subsets, and
> > in any case we'd run join planning on those level-one rels and create new
> > UPPERREL_GROUP_AGG RelOptInfos that represent the intermediate join steps
> > leading up to the final join. The correct indexing structure for this
> > collection of RelOptInfos is certainly not simple_rel_array; it should
> > look more like the way we index the various regular join rels that we
> > consider.

> > (Maybe an appropriate preliminary patch is to refactor the support code
> > associated with join_rel_list + join_rel_hash so that we can treat those
> > fields as one instance of a structure we'll replicate later.)

> Do you mean that, instead of a single RelOptInfo, join_rel_list would contain
> a structure that points to two RelOptInfo structures, where one is the "plain"
> one and the other is the "grouped" one?

Eventually I thought that you propose a separate instance of join_rel_list
(e.g. join_rel_list_grouped) for the 2nd run of the joining process, however
that would mean that some functions need to receive RelOptInfos as arguments,
and *also* search in the the join_rel_list_grouped when joining grouped
relation to non-grouped one. So I decided to wrap both grouped and non-grouped
relation into a new structure (RelOptInfoSet). Thus all the relations to be
joined are passed as function arguments. make_join_rel() shows best how
instances of the new structure is used.

> [1]
> [2]

Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt

Attachment Content-Type Size
v09-001-Export_estimate_hashagg_tablesize.patch text/x-diff 4.5 KB
v09-002-Introduce_make_join_rel_common.patch text/x-diff 2.5 KB
v09-003-Introduce_estimate_join_rows.patch text/x-diff 2.0 KB
v09-004-Agg_pushdown_basic.patch text/x-diff 189.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2019-01-15 14:33:47 Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
Previous Message Peter Eisentraut 2019-01-15 13:24:33 Re: using expression syntax for partition bounds