|Antonin Houska <ah(at)cybertec(dot)at>
|Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
|Re: [HACKERS] WIP: Aggregation push-down
|Raw Message | Whole Thread | Download mbox | Resend email
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Antonin Houska <ah(at)cybertec(dot)at> writes:
> > [ agg_pushdown_v8.tgz ]
> I spent a few hours looking at this today.
> It seems to me that at no point has there been a clear explanation of what
> the patch is trying to accomplish, so let me see whether I've got it
> straight or not. (I suggest that something like this ought to be included
> in optimizer/README; the patch's lack of internal documentation is a serious
Earlier version of the patch did update optimizer/README but I forgot to merge
it into the current one. I'll fix that in the next version. (Also some more
will be added, especially header comments of some functions.)
> Conceptually, we'd like to be able to push aggregation down below joins
> when that yields a win, which it could do by reducing the number of rows
> that have to be processed at the join stage. Thus consider
> SELECT agg(a.x) FROM a, b WHERE a.y = b.z;
> We can't simply aggregate during the scan of "a", because some values of
> "x" might not appear at all in the input of the naively-computed aggregate
> (if their rows have no join partner in "b"), or might appear multiple
> times (if their rows have multiple join partners). So at first glance,
> aggregating before the join is impossible. The key insight of the patch
> is that we can make some progress by considering only grouped aggregation:
Yes, the patch does not handle queries with no GROUP BY clause. Primarily
because I don't know how the grouped "a" table in this example could emit the
"a.y" column w/o using it as a grouping key.
> if we can group "a" into sets of rows that must all have the same join
> partners, then we can do preliminary aggregation within each such group,
> and take care of the number-of-repetitions problem when we join. If the
> groups are large then this reduces the number of rows processed by the
> join, at the cost that we might spend time computing the aggregate for
> some row sets that will just be discarded by the join. So now we consider
> SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;
> and ask what properties the grouping column(s) "?" must have to make this
> work. I believe we can say that it's OK to push down through a join if
> its join clauses are all of the form "a.y = b.z", where either a.y or b.z
> is listed as a GROUP BY column *and* the join operator is equality for
> the btree opfamily specified by the SortGroupClause. (Note: actually,
> SortGroupClause per se mentions an equality operator, although I think the
> planner quickly reduces that to an opfamily specification.)
I suppose you mean make_pathkey_from_sortop().
> The concerns Robert had about equality semantics are certainly vital, but I
> believe that the logic goes through correctly as long as the grouping
> equality operator and the join equality operator belong to the same
ok, generic approach like this is always better. I just could not devise it,
nor can I prove that your theory is wrong.
> Robert postulated cases like "citext = text", but that doesn't apply
> here because no cross-type citext = text operator could be part of a
> well-behaved opfamily.
The reason I can see now is that the GROUP BY operator (opfamily) essentially
has the grouping column on both sides, so it does not match the cross-type
> What we'd actually be looking at is either "citextvar::text texteq textvar"
> or "citextvar citexteq textvar::citext", and the casts prevent these
> expressions from matching GROUP BY entries that have the wrong semantics.
Can you please explain this? If the expression is "citextvar::text texteq
textvar", then both operands of the operator should be of "text" type (because
the operator receives the output of the cast), so I'd expect a match to
SortGroupClause.eqop of clause like "GROUP BY <text expression>".
> However: what we have proven here is only that we can aggregate across
> a set of rows that must share the same join partners. We still have
> to be able to handle the case that the rowset has more than one join
> partner, which AFAICS means that we must use partial aggregation and
> then apply an "aggmultifn" (or else apply the aggcombinefn N times).
> We can avoid that and use plain aggregation when we can prove the "b"
> side of the join is unique, so that no sets of rows will have to be merged
> post-join; but ISTM that that reduces the set of use cases to be too small
> to be worth such a complex patch. So I'm really doubtful that we should
> proceed forward with only that case available.
I tried to follow Heikki's proposal in  and separated the functionality
that does not need the partial aggregation. I was not able to foresee that the
patch does not get much smaller this way. Also because I had to introduce
functions that check whether particular join does duplicate aggregated values
or not. (Even if we use partial aggregation, we can use this functionality to
detect that we only need to call aggfinalfn() in some cases, as opposed to
setting up regular Agg node for the final aggregation. However that's an
optimization that can probably be moved to later position in the patch
I'll re-introduce the parallel aggregation in the next patch version.
> Also, Tomas complained in the earlier thread that he didn't think
> grouping on the join column was a very common use-case in the first
> place. I share that concern, but I think we could extend the logic
> to the case that Tomas posited as being useful:
> SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z;
> where the join column b.id is unique. If we group on a.y (using semantics
> compatible with the join operator and the uniqueness constraint), then all
> "a" rows in a given group will join to exactly one "b" row that
> necessarily has exactly one grouping value, so this group can safely be
> aggregated together. We might need to combine it post-join with other "b"
> rows that have equal "z" values, but we can do that as long as we're okay
> with partial aggregation. I think this example shows why the idea is far
> more powerful with partial aggregation than without.
Good idea. I'll try to incorporate it in the next version.
> In short, then, I don't have much use for the patch as presented in this
> thread, without "aggmultifn". That might be OK as the first phase in a
> multi-step patch, but not as a production feature.
The version 8 actually implements only part of the functionality that earlier
versions did. I wanted to have more feedback from the community on the concept
before I rework the whole series again. So for version 9 I'm going to include
the partial aggregation. On the other hand, support of things like parallel
processing, append relation or postgres_fdw doesn't seem to be necessary for
the inital version of the feature.
> I also have no use for the stuff depending on bitwise equality rather than
> the user-visible operators; that's just a hack, and an ugly one.
The purpose of this was to avoid aggregation push-down for data types like
numeric, see the example
FROM one, two
WHERE one.a = two.a AND scale(two.a) > 1
GROUP BY 1
in . The problem is that aggregation can make some information lost, which
is needed above by WHERE/ON clauses. I appreciate any suggestions how to
address this problem.
> As far as the patch details go, I've not looked through it in great
> detail, but it appears to me that you are still trying to cram the same
> stuff into base relations as before; shoving it into a subsidiary struct
> doesn't improve that. Multiple people have said that's the wrong thing,
> and I agree. Conceptually it's a disaster: a single RelOptInfo should
> represent one well-defined set of result rows, not more than one.
The first version of my patch did what you reject here, but then I started to
use a separate RelOptInfo for the grouped relations. First I introduced
simple_grouped_rel_array, but then (again per Heikki's suggestion) made the
"plain" RelOptInfo point to the grouped one.
> I think the right representation is to create UPPERREL_GROUP_AGG RelOptInfos
(Just a detail: since UPPERREL_PARTIAL_GROUP_AGG was added at some point, I
wonder if this is more appropriate than UPPERREL_GROUP_AGG for base relations
and joins. create_grouping_paths() can then add the paths to
> whose relids show the set of baserels joined before aggregating. Currently
> there's only one of those and its relids is equal to all_baserels (or should
> be, anyway). This patch would add instances of UPPERREL_GROUP_AGG
> RelOptInfos with singleton relids, and later we might put in the ability to
> consider aggregation across other relid subsets, and in any case we'd run
> join planning on those level-one rels and create new UPPERREL_GROUP_AGG
> RelOptInfos that represent the intermediate join steps leading up to the
> final join. The correct indexing structure for this collection of
> RelOptInfos is certainly not simple_rel_array; it should look more like the
> way we index the various regular join rels that we consider.
Nothing like simple_rel_array (e.g. simple_grouped_rel_array) even for the
base relations? I think the fact that the extra base relations are grouped
should not affect the way they are retrieved.
> (Maybe an appropriate preliminary patch is to refactor the support code
> associated with join_rel_list + join_rel_hash so that we can treat those
> fields as one instance of a structure we'll replicate later.)
Do you mean that, instead of a single RelOptInfo, join_rel_list would contain
a structure that points to two RelOptInfo structures, where one is the "plain"
one and the other is the "grouped" one?
> Note that I'm envisioning that we'd basically run join planning a second
> time on this tree of join rels, rather than try to merge it with the primary
> join planning. Since the numbers of rows to be processed will be completely
> different in this join tree, merging it with the standard one seems
A separate run of make_join_rel() for the grouped relations is what I tried to
do so far.
> BTW, I noticed some noise in the v8 patch: what's it doing touching
> src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c?
It's definitely just a noise. Something went wrong when I was extracting the
diff from my private repository.
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
|Re: Referential Integrity Checks with Statement-level Triggers
|Re: Pluggable Storage - Andres's take