Eager aggregation, take 3

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Eager aggregation, take 3
Date: 2024-03-04 08:27:24
Message-ID: CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

Eager aggregation is a query optimization technique that partially
pushes a group-by past a join, and finalizes it once all the relations
are joined. Eager aggregation reduces the number of input rows to the
join and thus may result in a better overall plan. This technique is
thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
paper [1].

Back in 2017, a patch set has been proposed by Antonin Houska to
implement eager aggregation in thread [2]. However, it was at last
withdrawn after entering the pattern of "please rebase thx" followed by
rebasing and getting no feedback until "please rebase again thx". A
second attempt in 2022 unfortunately fell into the same pattern about
one year ago and was eventually closed again [3].

That patch set has included most of the necessary concepts to implement
eager aggregation. However, as far as I can see, it has several weak
points that we need to address. It introduces invasive changes to some
core planner functions, such as make_join_rel(). And with such changes
join_is_legal() would be performed three times for the same proposed
join, which is not great. Another weak point is that the complexity of
join searching dramatically increases with the growing number of
relations to be joined. This occurs because when we generate partially
aggregated paths, each path of the input relation is considered as an
input path for the grouped paths. As a result, the number of grouped
paths we generate increases exponentially, leading to a significant
explosion in computational complexity. Other weak points include the
lack of support for outer joins and partitionwise joins. And during my
review of the code, I came across several bugs (planning error or crash)
that need to be addressed.

I'd like to give it another take to implement eager aggregation, while
borrowing lots of concepts and many chunks of codes from the previous
patch set. Please see attached. I have chosen to use the term 'Eager
Aggregation' from the paper [1] instead of 'Aggregation push-down', to
differentiate the aggregation push-down technique in FDW.

The patch has been split into small pieces to make the review easier.

0001 introduces the RelInfoList structure, which encapsulates both a
list and a hash table, so that we can leverage the hash table for faster
lookups not only for join relations but also for upper relations. With
eager aggregation, it is possible that we generate so many upper rels of
type UPPERREL_PARTIAL_GROUP_AGG that a hash table can help a lot with
lookups.

0002 introduces the RelAggInfo structure to store information needed to
create grouped paths for base and join rels. It also revises the
RelInfoList related structures and functions so that they can be used
with RelAggInfos.

0003 checks if eager aggregation is applicable, and if so, collects
suitable aggregate expressions and grouping expressions in the query,
and records them in root->agg_clause_list and root->group_expr_list
respectively.

0004 implements the functions that check if eager aggregation is
applicable for a given relation, and if so, create RelAggInfo structure
for the relation, using the infos about aggregate expressions and
grouping expressions we collected earlier. In this patch, when we check
if a target expression can act as grouping expression, we need to check
if this expression can be known equal to other expressions due to ECs
that can act as grouping expressions. This patch leverages function
exprs_known_equal() to achieve that, after enhancing this function to
consider opfamily if provided.

0005 implements the functions that generate paths for grouped relations
by adding sorted and hashed partial aggregation paths on top of paths of
the plain base or join relations. For sorted partial aggregation paths,
we only consider any suitably-sorted input paths as well as sorting the
cheapest-total path. For hashed partial aggregation paths, we only
consider the cheapest-total path as input. By not considering other
paths we can reduce the number of grouping paths as much as possible
while still achieving reasonable results.

0006 builds grouped relations for each base relation if possible, and
generates aggregation paths for the grouped base relations.

0007 builds grouped relations for each just-processed join relation if
possible, and generates aggregation paths for the grouped join
relations. The changes made to make_join_rel() are relatively minor,
with the addition of a new function make_grouped_join_rel(), which finds
or creates a grouped relation for the just-processed joinrel, and
generates grouped paths by joining a grouped input relation with a
non-grouped input relation.

The other way to generate grouped paths is by adding sorted and hashed
partial aggregation paths on top of paths of the joinrel. This occurs
in standard_join_search(), after we've run set_cheapest() for the
joinrel. The reason for performing this step after set_cheapest() is
that we need to know the joinrel's cheapest paths (see 0005).

This patch also makes the grouped relation for the topmost join rel act
as the upper rel representing the result of partial aggregation, so that
we can add the final aggregation on top of that. Additionally, this
patch extends the functionality of eager aggregation to work with
partitionwise join and geqo.

This patch also makes eager aggregation work with outer joins. With
outer join, the aggregate cannot be pushed down if any column referenced
by grouping expressions or aggregate functions is nullable by an outer
join above the relation to which we want to apply the partiall
aggregation. Thanks to Tom's outer-join-aware-Var infrastructure, we
can easily identify such situations and subsequently refrain from
pushing down the aggregates.

Starting from this patch, you should be able to see plans with eager
aggregation.

0008 adds test cases for eager aggregation.

0009 adds a section in README that describes this feature (copied from
previous patch set, with minor tweaks).

Thoughts and comments are welcome.

[1] https://www.vldb.org/conf/1995/P345.PDF
[2] https://www.postgresql.org/message-id/flat/9666.1491295317%40localhost
[3]
https://www.postgresql.org/message-id/flat/OS3PR01MB66609589B896FBDE190209F495EE9%40OS3PR01MB6660.jpnprd01.prod.outlook.com

Thanks
Richard

Attachment Content-Type Size
v1-0001-Introduce-RelInfoList-structure.patch application/octet-stream 14.3 KB
v1-0005-Implement-functions-that-generate-paths-for-grouped-relations.patch application/octet-stream 13.1 KB
v1-0004-Implement-functions-that-create-RelAggInfos-if-applicable.patch application/octet-stream 26.1 KB
v1-0002-Introduce-RelAggInfo-structure-to-store-info-for-grouped-paths.patch application/octet-stream 7.8 KB
v1-0003-Set-up-for-eager-aggregation-by-collecting-needed-infos.patch application/octet-stream 14.3 KB
v1-0006-Build-grouped-relations-out-of-base-relations.patch application/octet-stream 9.0 KB
v1-0007-Build-grouped-relations-out-of-join-relations.patch application/octet-stream 19.3 KB
v1-0009-Add-README.patch application/octet-stream 4.8 KB
v1-0008-Add-test-cases.patch application/octet-stream 66.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-03-04 08:31:37 Re: Comments on Custom RMGRs
Previous Message Andres Freund 2024-03-04 08:20:21 Re: Avoid stack frame setup in performance critical routines using tail calls