Quick Links

Re: Eager aggregation, take 3

From:	"Matheus Alcantara" <matheusssilv97(at)gmail(dot)com>
To:	"Richard Guo" <guofenglinux(at)gmail(dot)com>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tender Wang" <tndrwang(at)gmail(dot)com>, "Paul George" <p(dot)a(dot)george19(at)gmail(dot)com>, "Andy Fan" <zhihuifan1213(at)163(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Eager aggregation, take 3
Date:	2025-08-14 19:22:49
Message-ID:	DC2E8YU1D3M8.3M8S05QFY4XAR@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 08/08/25 22:32, Richard Guo wrote:
>> It sounds like a good way to go for me, looking forward to the next
>> patch version to perform some other tests.
>
> OK. Here it is.
>
Thanks! I can confirm now that I can see the eager aggregate in action
in some of these queries that I've tested on the TPC-DS benchmark.

I few questions regarding the new version:

I've noticed that when a query has a WHERE clause filtering columns from
the same relation being aggregated using "=" operator the Partial and
Finalize aggregation nodes are not present on explain results even if
setup_eager_aggregation() returns true on all if statements and also
RelAggInfo->agg_useful is true. For example, consider this query that is
used on eager aggregation paper that use some tables from TPC-H
benchmark:

tpch=# show enable_eager_aggregate ;
enable_eager_aggregate
------------------------
on
(1 row)

tpch=# set max_parallel_workers_per_gather to 0;
SET

tpch=# EXPLAIN(COSTS OFF) SELECT O_CLERK,
SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT)) AS LOSS
FROM LINEITEM
JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY
WHERE L_RETURNFLAG = 'R'
GROUP BY O_CLERK;
QUERY PLAN
--------------------------------------------------------------
HashAggregate
Group Key: orders.o_clerk
-> Hash Join
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Seq Scan on lineitem
Filter: (l_returnflag = 'R'::bpchar)
-> Hash
-> Seq Scan on orders
(8 rows)

Debugging this query shows that all if conditions on
setup_eager_aggregation() returns false and create_agg_clause_infos()
and create_grouping_expr_infos() are called. The RelAggInfo->agg_useful
is also being set to true so I would expect to see Finalize and Partial
agg nodes, is this correct or am I missing something here?

Removing the WHERE clause I can see the Finalize and Partial agg nodes:

tpch=# EXPLAIN(COSTS OFF) SELECT O_CLERK,
SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT)) AS LOSS
FROM LINEITEM
JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY
GROUP BY O_CLERK;
QUERY PLAN
----------------------------------------------------------------------
Finalize HashAggregate
Group Key: orders.o_clerk
-> Merge Join
Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Partial GroupAggregate
Group Key: lineitem.l_orderkey
-> Index Scan using idx_lineitem_orderkey on lineitem
-> Index Scan using orders_pkey on orders
(8 rows)

This can also be reproduced with an addition of a WHERE clause on some
tests on eager_aggregate.sql:

postgres=# EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.a, avg(t2.c)
FROM eager_agg_t1 t1
JOIN eager_agg_t2 t2
ON t1.b = t2.b
WHERE t2.c = 5
GROUP BY t1.a
ORDER BY t1.a;
QUERY PLAN
------------------------------------------------------------------
GroupAggregate
Output: t1.a, avg(t2.c)
Group Key: t1.a
-> Sort
Output: t1.a, t2.c
Sort Key: t1.a
-> Hash Join
Output: t1.a, t2.c
Hash Cond: (t1.b = t2.b)
-> Seq Scan on public.eager_agg_t1 t1
Output: t1.a, t1.b, t1.c
-> Hash
Output: t2.c, t2.b
-> Seq Scan on public.eager_agg_t2 t2
Output: t2.c, t2.b
Filter: (t2.c = '5'::double precision)
(16 rows)

Note that if I use ">" operator for example, this doesn't happen:
SELECT t1.a, avg(t2.c)
FROM eager_agg_t1 t1
JOIN eager_agg_t2 t2
ON t1.b = t2.b
WHERE t2.c > 5
GROUP BY t1.a
ORDER BY t1.a;
QUERY PLAN
------------------------------------------------------------------------
Finalize GroupAggregate
Output: t1.a, avg(t2.c)
Group Key: t1.a
-> Sort
Output: t1.a, (PARTIAL avg(t2.c))
Sort Key: t1.a
-> Hash Join
Output: t1.a, (PARTIAL avg(t2.c))
Hash Cond: (t1.b = t2.b)
-> Seq Scan on public.eager_agg_t1 t1
Output: t1.a, t1.b, t1.c
-> Hash
Output: t2.b, (PARTIAL avg(t2.c))
-> Partial HashAggregate
Output: t2.b, PARTIAL avg(t2.c)
Group Key: t2.b
-> Seq Scan on public.eager_agg_t2 t2
Output: t2.a, t2.b, t2.c
Filter: (t2.c > '5'::double precision)
(19 rows)

Is this behavior correct? If it's correct, would be possible to check
this limitation on setup_eager_aggregation() and maybe skip all the
other work?

--
Matheus Alcantara

In response to

Re: Eager aggregation, take 3 at 2025-08-09 01:32:21 from Richard Guo

Responses

Re: Eager aggregation, take 3 at 2025-08-15 01:41:18 from Richard Guo

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2025-08-14 19:30:16	Re: index prefetching
Previous Message	Tom Lane	2025-08-14 19:18:51	Re: Identifying function-lookup failures due to argument name mismatches