Re: Batching in executor

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Junwang Zhao <zhjwpku(at)gmail(dot)com>
Cc: cca5507 <cca5507(at)qq(dot)com>, Daniil Davydov <3danissimo(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: Batching in executor
Date: 2026-07-03 05:19:46
Message-ID: CA+HiwqHv1CggSbpn=n+GU=Cp1AqDD25=hzx_PTYyqjMyZO2+mA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The last version on this thread (v7, the "Rebased" post) used the
RowBatch design: the AM handed the executor a RowBatch carrying a
slice of tuples, a single scan slot was re-pointed at the current
tuple through a repoint_slot AM callback, and an executor_batch_rows
GUC controlled the batch size. As I described in my pgconf.dev talk,
I have regrouped around a smaller, incremental foundation and dropped
that design. This series is the result; it supersedes v7 rather than
extending it.

What changed from the RowBatch design:

* RowBatch is gone. There is no batch container passed across the
AM/executor boundary, no RowBatchOps, and no am_payload indirection.
The batch lives in the scan slot itself.

* v7 already used a single re-pointed scan slot (the slot-array
design, with separate in/out arrays for the qual evaluator, was
dropped before that). What changes here is that the re-point is a
slot op (batch_next) rather than a separate repoint_slot AM callback,
so the executor drives iteration through the normal slot interface and
the AM exposes nothing beyond its scan slot.

* executor_batch_rows is gone. Batching is not opt-in or
size-tuned: the AM serves a natural batch (for heap, one page's
visible tuples) and the executor consumes it a tuple at a time. There
is no GUC and no per-query batch sizing.

* EXPLAIN (ANALYZE, BATCHES) is gone. Its counters reported the
effect of the executor_batch_rows size knob; with a batch fixed at one
page there is nothing batch-specific left to show, since a batch count
would just track pages scanned. The instrumentation that would be
worth having -- time and cardinality per batch as it crosses a plan
edge -- only has something to measure once batches propagate beyond
the scan node, so I would revisit it when batching reaches further
into the executor.

* The batch qual evaluator is also not part of this series. Batched
expression evaluation remains future work; quals here are evaluated
per tuple through the existing path.

The interface is two table-AM callbacks -- scan_getnextbatch and
batch_slot_callbacks -- plus a batch_next slot op. As the series
stands a sequentially scanned AM must provide them: ExecInitSeqScan
takes the scan slot from table_slot_batch_callbacks() and SeqNext
drives table_scan_getnextbatch(), with no fallback to getnextslot, so
an AM lacking them cannot be seqscanned. That is deliberate -- it
keeps SeqNext to one path rather than a per-row capability branch --
but it does make these required of any heap-like AM, the way
scan_getnextslot is required today, and I would like opinions on
whether that is acceptable or whether a getnextslot fallback for AMs
that do not implement batching is worth the branch. (An out-of-tree
AM would need to add the two callbacks; both have straightforward
implementations on top of the existing page scan.)

The interface does not assume heap's representation: an AM that does
not produce per-tuple HeapTupleData (a columnar AM, say) is free to
choose how its batch holds data internally. What it must provide is
batch_next, which advances the slot to the current row and leaves it
deformable through the slot's ordinary deform routines (getsomeattrs
and friends); how the batch arrives at that row -- decoding a column
strip, materializing on demand -- is up to the AM. So the internal
layout is the AM's choice while the per-row face the executor sees is
fixed. The executor no longer allocates or manages receiving slots
and there is no row-oriented container an AM must fit into, which
addresses the AM-agnosticism concern from the earlier discussion.

Patche are:

0001 - heapam: store full HeapTupleData in rs_vistuples[].
Stores the per-tuple headers that page_collect_tuples() already
builds, instead of rederiving them per tuple in heapgettup_pagemode().
A standalone improvement to the existing pagemode path, independent of
the rest of the series and considerable on its own; it also gives the
batch path pre-built tuple headers to hand out. (This is the
rs_vistuples[] change from v7, essentially unchanged.)

0002 - tableam/slot interface for batched scans.
Adds scan_getnextbatch and batch_slot_callbacks to TableAmRoutine and
batch_next to TupleTableSlotOps, with their inline wrappers. Interface
only; no implementation, no caller.

0003 - heap implementation + sequential scan.
Implements the interface in heapam and uses it from the sequential
scan node. ExecInitSeqScan obtains the scan slot from
table_slot_batch_callbacks(); the existing ExecSeqScan variants drive
the batch slot unchanged. Forward and backward scans, including a
direction change within a batch, share one path, and the batch slot
deforms like a regular buffer-heap slot so EvalPlanQual and the rest
of the executor are unaffected.

Performance (meson release builds, master vs patched, pg_prewarm'd
table, vacuum-frozen for the all-visible rows; median ms over the
1M..10M row sizes, ranges across two runs):

all-visible not-all-visible
count(*) (no qual) -35% to -43% -21% to -31%
count(*) WHERE pass-all -17% to -23% -14% to -16%
count(*) WHERE pass-none -15% to -20% -13% to -18%

The win is largest where per-tuple scan overhead dominates -- no qual,
and all-visible pages where the visibility check is cheap -- and
proportionally smaller as qual evaluation (unchanged by this series)
is added. Two runs agree to within a couple of points at 5M and 10M;
the 1-2M figures are noisier on my machine, so the larger sizes are
the ones to trust.

Open items:
- Only sequential scan uses the batch interface; the other scan
nodes keep their existing fetch paths. The heap-page-oriented ones
(sample, TID-range, bitmap heap) look convertible along the same
lines; index and index-only scans are less direct and would more
likely connect through the ongoing index-prefetching work. I left
these out to keep the first step small, not because the interface
cannot express them.
- Batched expression evaluation (a batch_next-driven qual opcode)
and any non-HeapTupleData / columnar batch consumption remain
follow-on work, as discussed at pgconf.dev and earlier on this thread.

Where this is going:

This series stops at the scan/TAM boundary. Profiling a selective
count(*) ... WHERE shows why that is the right first cut: batching
removes the per-tuple scan-fetch overhead (heapgettup_pagemode and
friends), which is where the win comes from, and what remains is
per-tuple deform and per-tuple expression evaluation, each about a
quarter of the cycles, with the predicate operator itself a couple of
percent. Batching only the scan does not touch those, and a throwaway
patch I wrote that batched the qual loop moved almost nothing, so the
remaining cost is in the per-tuple executor work, not the loop around
it.

Some of that is improvable in the scalar path with no batching or
columnar representation at all (a denser per-attribute slot layout,
and avoiding the per-tuple indirect deform call where the slot type is
fixed); those help the row-at-a-time executor generally and overlap
the seqscan inefficiencies Andres has catalogued, and I am pursuing
them separately. Beyond that, letting expression evaluation or a
parent node consume a batch as columns rather than a tuple at a time
is the larger direction, but it turns on how batch column data should
be represented, which I would not want to settle yet. What this
series tries to get right for all of it is that the batch lives in the
slot and batch_next is the row-compatible way to walk it, so later
work can reach the batch without a new cross-node container and
anything not converted keeps working unchanged.

--
Thanks, Amit Langote

Attachment Content-Type Size
v8-0003-Implement-batched-sequential-scans-for-heap.patch application/x-patch 23.1 KB
v8-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch application/x-patch 16.5 KB
v8-0002-Add-table-AM-and-slot-interface-for-batched-scans.patch application/x-patch 5.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message solai v 2026-07-03 05:23:31 Re: [PATCH v2] Avoid internal error for invalid interval typmods
Previous Message Shinya Kato 2026-07-03 05:06:55 Re: Add a statistics view to track usage of deprecated features