Re: [HACKERS] WIP: Aggregation push-down

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] WIP: Aggregation push-down
Date: 2018-11-18 21:43:59
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Antonin Houska <ah(at)cybertec(dot)at> writes:
> [ agg_pushdown_v8.tgz ]

I spent a few hours looking at this today. It seems to me that at no
point has there been a clear explanation of what the patch is trying to
accomplish, so let me see whether I've got it straight or not. (I suggest
that something like this ought to be included in optimizer/README; the
patch's lack of internal documentation is a serious deficiency.)


Conceptually, we'd like to be able to push aggregation down below joins
when that yields a win, which it could do by reducing the number of rows
that have to be processed at the join stage. Thus consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z;

We can't simply aggregate during the scan of "a", because some values of
"x" might not appear at all in the input of the naively-computed aggregate
(if their rows have no join partner in "b"), or might appear multiple
times (if their rows have multiple join partners). So at first glance,
aggregating before the join is impossible. The key insight of the patch
is that we can make some progress by considering only grouped aggregation:
if we can group "a" into sets of rows that must all have the same join
partners, then we can do preliminary aggregation within each such group,
and take care of the number-of-repetitions problem when we join. If the
groups are large then this reduces the number of rows processed by the
join, at the cost that we might spend time computing the aggregate for
some row sets that will just be discarded by the join. So now we consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;

and ask what properties the grouping column(s) "?" must have to make this
work. I believe we can say that it's OK to push down through a join if
its join clauses are all of the form "a.y = b.z", where either a.y or b.z
is listed as a GROUP BY column *and* the join operator is equality for
the btree opfamily specified by the SortGroupClause. (Note: actually,
SortGroupClause per se mentions an equality operator, although I think the
planner quickly reduces that to an opfamily specification.) The concerns
Robert had about equality semantics are certainly vital, but I believe
that the logic goes through correctly as long as the grouping equality
operator and the join equality operator belong to the same opfamily.

Case 1: the GROUP BY column is a.y. Two rows of "a" whose y values are
equal per the grouping operator must join to exactly the same set of "b"
rows, else transitivity is failing.

Case 2: the GROUP BY column is b.z. It still works, because the set of
"a" rows that are equal to a given z value is well-defined, and those
rows must join to exactly the "b" rows whose z entries are equal to
the given value, else transitivity is failing.

Robert postulated cases like "citext = text", but that doesn't apply
here because no cross-type citext = text operator could be part of a
well-behaved opfamily. What we'd actually be looking at is either
"citextvar::text texteq textvar" or "citextvar citexteq textvar::citext",
and the casts prevent these expressions from matching GROUP BY entries
that have the wrong semantics.

However: what we have proven here is only that we can aggregate across
a set of rows that must share the same join partners. We still have
to be able to handle the case that the rowset has more than one join
partner, which AFAICS means that we must use partial aggregation and
then apply an "aggmultifn" (or else apply the aggcombinefn N times).
We can avoid that and use plain aggregation when we can prove the "b"
side of the join is unique, so that no sets of rows will have to be merged
post-join; but ISTM that that reduces the set of use cases to be too small
to be worth such a complex patch. So I'm really doubtful that we should
proceed forward with only that case available.

Also, Tomas complained in the earlier thread that he didn't think
grouping on the join column was a very common use-case in the first
place. I share that concern, but I think we could extend the logic
to the case that Tomas posited as being useful:

SELECT agg(a.x) FROM a, b WHERE a.y = GROUP BY b.z;

where the join column is unique. If we group on a.y (using semantics
compatible with the join operator and the uniqueness constraint), then all
"a" rows in a given group will join to exactly one "b" row that
necessarily has exactly one grouping value, so this group can safely be
aggregated together. We might need to combine it post-join with other "b"
rows that have equal "z" values, but we can do that as long as we're okay
with partial aggregation. I think this example shows why the idea is far
more powerful with partial aggregation than without.


In short, then, I don't have much use for the patch as presented in this
thread, without "aggmultifn". That might be OK as the first phase in a
multi-step patch, but not as a production feature. I also have no use
for the stuff depending on bitwise equality rather than the user-visible
operators; that's just a hack, and an ugly one.

As far as the patch details go, I've not looked through it in great
detail, but it appears to me that you are still trying to cram the same
stuff into base relations as before; shoving it into a subsidiary struct
doesn't improve that. Multiple people have said that's the wrong thing,
and I agree. Conceptually it's a disaster: a single RelOptInfo should
represent one well-defined set of result rows, not more than one. This
approach also cannot be extended to handle doing aggregations partway up
the join tree, which is at least theoretically interesting (though I'm
fine with not tackling it right away). I think the right representation
is to create UPPERREL_GROUP_AGG RelOptInfos whose relids show the set
of baserels joined before aggregating. Currently there's only one of
those and its relids is equal to all_baserels (or should be, anyway).
This patch would add instances of UPPERREL_GROUP_AGG RelOptInfos with
singleton relids, and later we might put in the ability to consider
aggregation across other relid subsets, and in any case we'd run join
planning on those level-one rels and create new UPPERREL_GROUP_AGG
RelOptInfos that represent the intermediate join steps leading up to
the final join. The correct indexing structure for this collection of
RelOptInfos is certainly not simple_rel_array; it should look more like
the way we index the various regular join rels that we consider. (Maybe
an appropriate preliminary patch is to refactor the support code
associated with join_rel_list + join_rel_hash so that we can treat those
fields as one instance of a structure we'll replicate later.) Note that
I'm envisioning that we'd basically run join planning a second time on
this tree of join rels, rather than try to merge it with the primary
join planning. Since the numbers of rows to be processed will be
completely different in this join tree, merging it with the standard one
seems hopeless.

BTW, I noticed some noise in the v8 patch: what's it doing touching
src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c?

regards, tom lane

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-11-18 21:52:27 logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data
Previous Message Andres Freund 2018-11-18 21:22:22 Re: docs should mention that max_wal_size default depends on WAL segment size