Batching in executor

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Batching in executor
Date: 2025-09-26 13:28:33
Message-ID: CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At PGConf.dev this year we had an unconference session [1] on whether
the community can support an additional batch executor. The discussion
there led me to start hacking on $subject. I have also had off-list
discussions on this topic in recent months with Andres and David, who
have offered useful thoughts.

This patch series is an early attempt to make executor nodes pass
around batches of tuples instead of tuple-at-a-time slots. The main
motivation is to enable expression evaluation in batch form, which can
substantially reduce per-tuple overhead (mainly from function calls)
and open the door to further optimizations such as SIMD usage in
aggregate transition functions. We could even change algorithms of
some plan nodes to operate on batches when, for example, a child node
can return batches.

The expression evaluation changes are still exploratory, but before
moving to make them ready for serious review, we first need a way for
scan nodes to produce tuples in batches and an executor API that
allows upper nodes to consume them. The series includes both the
foundational work to let scan nodes produce batches and an executor
API to pass them around, and a set of follow-on patches that
experiment with batch-aware expression evaluation.

The patch set is structured in two parts. The first three patches lay
the groundwork in the executor and table AM, and the later patches
prototype batch-aware expression evaluation.

Patches 0001-0003 introduce a new batch table AM API and an initial
heapam implementation that can return multiple tuples per call.
SeqScan is adapted to use this interface, with new ExecSeqScanBatch*
routines that fetch tuples in bulk but can still return one
TupleTableSlot at a time to preserve compatibility. On the executor
side, ExecProcNodeBatch() is added alongside ExecProcNode(), with
TupleBatch as the new container for passing groups of tuples. ExecScan
has batch-aware variants that use the AM API internally, but can fall
back to row-at-a-time behavior when required. Plan shapes and EXPLAIN
output remain unchanged; the differences here are executor-internal.

At present, heapam batches are restricted to tuples from a single
page, which means they may not always fill EXEC_BATCH_ROWS (currently
64). That limits how much upper executor nodes can leverage batching,
especially with selective quals where batches may end up sparsely
populated. A future improvement would be to allow batches to span
pages or to let the scan node request more tuples when its buffer is
not yet full, so it avoids passing mostly empty TupleBatch to upper
nodes.

It might also be worth adding some lightweight instrumentation to make
it easier to reason about batch behavior. For example, counters for
average rows per batch, reasons why a batch ended (capacity reached,
page boundary, end of scan), or batches per million rows could help
confirm whether limitations like the single-page restriction or
EXEC_BATCH_ROWS size are showing up in benchmarks. Suggestions from
others on which forms of instrumentation would be most useful are
welcome.

Patches 0004 onwards start experimenting with making expression
evaluation batch-aware, first in the aggregate node. These patches add
new EEOPs (ExprEvalOps and ExprEvalSteps) to fetch attributes into
TupleBatch vectors, evaluate quals across a batch, and run aggregate
transitions over multiple rows at once. Agg is extended to pull
TupleBatch from its child via ExecProcNodeBatch(), with two prototype
paths: one that loops inside the interpreter and another that calls
the transition function once per batch using AggBulkArgs. These are
still PoCs, but with scan nodes and the executor capable of moving
batches around, they provide a base from which the work can be refined
into something potentially committable after the usual polish,
testing, and review.

One area that needs more thought is how TupleBatch interacts with
ExprContext. At present the patches extend ExprContext with
scan_batch, inner_batch, and outer_batch fields, but per-batch
evaluation still spills into ecxt_per_tuple_memory, effectively
reusing the per-tuple context for per-batch work. That’s arguably an
abuse of the contract described in ExecEvalExprSwitchContext(), and it
will need a cleaner definition of how batch-scoped memory should be
managed. Feedback on how best to structure that would be particularly
helpful.

To evaluate the overheads and benefits, I ran microbenchmarks with
single and multi-aggregate queries on a single table, with and without
WHERE clauses. Tables were fully VACUUMed so visibility maps are set
and IO costs are minimal. shared_buffers was large enough to fit the
whole table (up to 10M rows, ~43 on each page), and all pages were
prewarmed into cache before tests. Table schema/script is at [2].

Observations from benchmarking (Detailed benchmark tables are at [3];
below is just a high-level summary of the main patterns):

* Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
sum(a) FROM bar_N): batching scan output alone improved latency by
~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
especially once fmgr overhead was paid per batch instead of per row.

* Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
qual interpreter gave a big step up, with latencies dropping by
~30-40% compared to batching=off.

* Five aggregates, no WHERE: batching input from the child scan cut
~15% off runtime. Adding batched transition evaluation increased
improvements to ~30%.

* Five aggregates, with WHERE: modest gains from scan/input batching,
but per-batch transition evaluation and batched quals brought ~20-30%
improvement.

* Across all cases, executor overheads became visible only after IO
was minimized. Once executor cost dominated, batching consistently
reduced CPU time, with the largest benefits coming from avoiding
per-row fmgr calls and evaluating quals across batches.

I would appreciate if others could try these patches with their own
microbenchmarks or workloads and see if they can reproduce numbers
similar to mine. Feedback on both the general direction and the
details of the patches would be very helpful. In particular, patches
0001-0003, which add the basic batch APIs and integrate them into
SeqScan, are intended to be the first candidates for review and
eventual commit. Comments on the later, more experimental patches
(aggregate input batching and expression evaluation (qual, aggregate
transition) batching) are also welcome.

--
Thanks, Amit Langote

[1] https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference#Can_the_Community_Support_an_Additional_Batch_Executor

[2] Tables:
cat create_tables.sh
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "drop table if exists bar_$i; create table bar_$i (a int, b
int, c int, d int, e int, f int, g int, h int, i text, j int, k int, l
int, m int, n int, o int);" 2>&1 > /dev/null
psql -c "insert into bar_$i select i, i, i, i, i, i, i, i, repeat('x',
100), i, i, i, i, i, i from generate_series(1, $i) i;" 2>&1 >
/dev/null
echo "bar_$i created."
done

[3] Benchmark result tables

All timings are in milliseconds. off = executor_batching off, on =
executor_batching on. Negative %diff means on is better than off.

Single aggregate, no WHERE
(~20% faster with scan batching only; ~40%+ faster with batched transitions)

With only batched-seqscan (0001-0003):
Rows off on %diff
1M 10.448 8.147 -22.0
2M 18.442 14.552 -21.1
3M 25.296 22.195 -12.3
4M 36.285 33.383 -8.0
5M 44.441 39.894 -10.2
10M 93.110 82.744 -11.1

With batched-agg on top (0001-0007):
Rows off on %diff
1M 9.891 5.579 -43.6
2M 17.648 9.653 -45.3
3M 27.451 13.919 -49.3
4M 36.394 24.269 -33.3
5M 44.665 29.260 -34.5
10M 87.898 56.221 -36.0

Single aggregate, with WHERE
(~30–40% faster once quals + transitions are batched)

With only batched-seqscan (0001-0003):
Rows off on %diff
1M 18.485 17.749 -4.0
2M 34.696 33.033 -4.8
3M 49.582 46.155 -6.9
4M 70.270 67.036 -4.6
5M 84.616 81.013 -4.3
10M 174.649 164.611 -5.7

With batched-agg and batched-qual on top (0001-0008):
Rows off on %diff
1M 18.887 12.367 -34.5
2M 35.706 22.457 -37.1
3M 51.626 30.902 -40.1
4M 72.694 48.214 -33.7
5M 88.103 57.623 -34.6
10M 181.350 124.278 -31.5

Five aggregates, no WHERE
(~15% faster with scan/input batching; ~30% with batched transitions)

Agg input batching only (0001-0004):
Rows off on %diff
1M 23.193 19.196 -17.2
2M 42.177 35.862 -15.0
3M 62.192 51.121 -17.8
4M 83.215 74.665 -10.3
5M 99.426 91.904 -7.6
10M 213.794 184.263 -13.8

Batched transition eval, per-row fmgr (0001-0006):
Rows off on %diff
1M 23.501 19.672 -16.3
2M 44.128 36.603 -17.0
3M 64.466 53.079 -17.7
5M 103.442 97.623 -5.6
10M 219.120 190.354 -13.1

Batched transition eval, per-batch fmgr (0001-0007):
Rows off on %diff
1M 24.238 16.806 -30.7
2M 43.056 30.939 -28.1
3M 62.938 43.295 -31.2
4M 83.346 63.357 -24.0
5M 100.772 78.351 -22.2
10M 213.755 162.203 -24.1

Five aggregates, with WHERE
(~10–15% faster with scan/input batching; ~30% with batched transitions + quals)

Agg input batching only (0001-0004):
Rows off on %diff
1M 24.261 22.744 -6.3
2M 45.802 41.712 -8.9
3M 79.311 72.732 -8.3
4M 107.189 93.870 -12.4
5M 129.172 115.300 -10.7
10M 278.785 236.275 -15.2

Batched transition eval, per-batch fmgr (0001-0007):
Rows off on %diff
1M 24.354 19.409 -20.3
2M 46.888 36.687 -21.8
3M 82.147 57.683 -29.8
4M 109.616 76.471 -30.2
5M 133.777 94.776 -29.2
10M 282.514 194.954 -31.0

Batched transition eval + batched qual (0001-0008):
Rows off on %diff
1M 24.691 20.193 -18.2
2M 47.182 36.530 -22.6
3M 82.030 58.663 -28.5
4M 110.573 76.500 -30.8
5M 136.701 93.299 -31.7
10M 280.551 191.021 -31.9

Attachment Content-Type Size
v1-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch application/octet-stream 11.2 KB
v1-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch application/octet-stream 6.3 KB
v1-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch application/octet-stream 21.5 KB
v1-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch application/octet-stream 16.9 KB
v1-0002-SeqScan-add-batch-driven-variants-returning-slots.patch application/octet-stream 27.2 KB
v1-0001-Add-batch-table-AM-API-and-heapam-implementation.patch application/octet-stream 13.7 KB
v1-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch application/octet-stream 9.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2025-09-26 13:46:35 Re: splitting src/bin/scripts/vacuumdb.c
Previous Message Álvaro Herrera 2025-09-26 13:22:48 Re: refactor backend type lists