| From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
|---|---|
| To: | Junwang Zhao <zhjwpku(at)gmail(dot)com> |
| Cc: | cca5507 <cca5507(at)qq(dot)com>, Daniil Davydov <3danissimo(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me> |
| Subject: | Re: Batching in executor |
| Date: | 2026-07-03 05:19:46 |
| Message-ID: | CA+HiwqHv1CggSbpn=n+GU=Cp1AqDD25=hzx_PTYyqjMyZO2+mA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
The last version on this thread (v7, the "Rebased" post) used the
RowBatch design: the AM handed the executor a RowBatch carrying a
slice of tuples, a single scan slot was re-pointed at the current
tuple through a repoint_slot AM callback, and an executor_batch_rows
GUC controlled the batch size. As I described in my pgconf.dev talk,
I have regrouped around a smaller, incremental foundation and dropped
that design. This series is the result; it supersedes v7 rather than
extending it.
What changed from the RowBatch design:
* RowBatch is gone. There is no batch container passed across the
AM/executor boundary, no RowBatchOps, and no am_payload indirection.
The batch lives in the scan slot itself.
* v7 already used a single re-pointed scan slot (the slot-array
design, with separate in/out arrays for the qual evaluator, was
dropped before that). What changes here is that the re-point is a
slot op (batch_next) rather than a separate repoint_slot AM callback,
so the executor drives iteration through the normal slot interface and
the AM exposes nothing beyond its scan slot.
* executor_batch_rows is gone. Batching is not opt-in or
size-tuned: the AM serves a natural batch (for heap, one page's
visible tuples) and the executor consumes it a tuple at a time. There
is no GUC and no per-query batch sizing.
* EXPLAIN (ANALYZE, BATCHES) is gone. Its counters reported the
effect of the executor_batch_rows size knob; with a batch fixed at one
page there is nothing batch-specific left to show, since a batch count
would just track pages scanned. The instrumentation that would be
worth having -- time and cardinality per batch as it crosses a plan
edge -- only has something to measure once batches propagate beyond
the scan node, so I would revisit it when batching reaches further
into the executor.
* The batch qual evaluator is also not part of this series. Batched
expression evaluation remains future work; quals here are evaluated
per tuple through the existing path.
The interface is two table-AM callbacks -- scan_getnextbatch and
batch_slot_callbacks -- plus a batch_next slot op. As the series
stands a sequentially scanned AM must provide them: ExecInitSeqScan
takes the scan slot from table_slot_batch_callbacks() and SeqNext
drives table_scan_getnextbatch(), with no fallback to getnextslot, so
an AM lacking them cannot be seqscanned. That is deliberate -- it
keeps SeqNext to one path rather than a per-row capability branch --
but it does make these required of any heap-like AM, the way
scan_getnextslot is required today, and I would like opinions on
whether that is acceptable or whether a getnextslot fallback for AMs
that do not implement batching is worth the branch. (An out-of-tree
AM would need to add the two callbacks; both have straightforward
implementations on top of the existing page scan.)
The interface does not assume heap's representation: an AM that does
not produce per-tuple HeapTupleData (a columnar AM, say) is free to
choose how its batch holds data internally. What it must provide is
batch_next, which advances the slot to the current row and leaves it
deformable through the slot's ordinary deform routines (getsomeattrs
and friends); how the batch arrives at that row -- decoding a column
strip, materializing on demand -- is up to the AM. So the internal
layout is the AM's choice while the per-row face the executor sees is
fixed. The executor no longer allocates or manages receiving slots
and there is no row-oriented container an AM must fit into, which
addresses the AM-agnosticism concern from the earlier discussion.
Patche are:
0001 - heapam: store full HeapTupleData in rs_vistuples[].
Stores the per-tuple headers that page_collect_tuples() already
builds, instead of rederiving them per tuple in heapgettup_pagemode().
A standalone improvement to the existing pagemode path, independent of
the rest of the series and considerable on its own; it also gives the
batch path pre-built tuple headers to hand out. (This is the
rs_vistuples[] change from v7, essentially unchanged.)
0002 - tableam/slot interface for batched scans.
Adds scan_getnextbatch and batch_slot_callbacks to TableAmRoutine and
batch_next to TupleTableSlotOps, with their inline wrappers. Interface
only; no implementation, no caller.
0003 - heap implementation + sequential scan.
Implements the interface in heapam and uses it from the sequential
scan node. ExecInitSeqScan obtains the scan slot from
table_slot_batch_callbacks(); the existing ExecSeqScan variants drive
the batch slot unchanged. Forward and backward scans, including a
direction change within a batch, share one path, and the batch slot
deforms like a regular buffer-heap slot so EvalPlanQual and the rest
of the executor are unaffected.
Performance (meson release builds, master vs patched, pg_prewarm'd
table, vacuum-frozen for the all-visible rows; median ms over the
1M..10M row sizes, ranges across two runs):
all-visible not-all-visible
count(*) (no qual) -35% to -43% -21% to -31%
count(*) WHERE pass-all -17% to -23% -14% to -16%
count(*) WHERE pass-none -15% to -20% -13% to -18%
The win is largest where per-tuple scan overhead dominates -- no qual,
and all-visible pages where the visibility check is cheap -- and
proportionally smaller as qual evaluation (unchanged by this series)
is added. Two runs agree to within a couple of points at 5M and 10M;
the 1-2M figures are noisier on my machine, so the larger sizes are
the ones to trust.
Open items:
- Only sequential scan uses the batch interface; the other scan
nodes keep their existing fetch paths. The heap-page-oriented ones
(sample, TID-range, bitmap heap) look convertible along the same
lines; index and index-only scans are less direct and would more
likely connect through the ongoing index-prefetching work. I left
these out to keep the first step small, not because the interface
cannot express them.
- Batched expression evaluation (a batch_next-driven qual opcode)
and any non-HeapTupleData / columnar batch consumption remain
follow-on work, as discussed at pgconf.dev and earlier on this thread.
Where this is going:
This series stops at the scan/TAM boundary. Profiling a selective
count(*) ... WHERE shows why that is the right first cut: batching
removes the per-tuple scan-fetch overhead (heapgettup_pagemode and
friends), which is where the win comes from, and what remains is
per-tuple deform and per-tuple expression evaluation, each about a
quarter of the cycles, with the predicate operator itself a couple of
percent. Batching only the scan does not touch those, and a throwaway
patch I wrote that batched the qual loop moved almost nothing, so the
remaining cost is in the per-tuple executor work, not the loop around
it.
Some of that is improvable in the scalar path with no batching or
columnar representation at all (a denser per-attribute slot layout,
and avoiding the per-tuple indirect deform call where the slot type is
fixed); those help the row-at-a-time executor generally and overlap
the seqscan inefficiencies Andres has catalogued, and I am pursuing
them separately. Beyond that, letting expression evaluation or a
parent node consume a batch as columns rather than a tuple at a time
is the larger direction, but it turns on how batch column data should
be represented, which I would not want to settle yet. What this
series tries to get right for all of it is that the batch lives in the
slot and batch_next is the row-compatible way to walk it, so later
work can reach the batch without a new cross-node container and
anything not converted keeps working unchanged.
--
Thanks, Amit Langote
| Attachment | Content-Type | Size |
|---|---|---|
| v8-0003-Implement-batched-sequential-scans-for-heap.patch | application/x-patch | 23.1 KB |
| v8-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch | application/x-patch | 16.5 KB |
| v8-0002-Add-table-AM-and-slot-interface-for-batched-scans.patch | application/x-patch | 5.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | solai v | 2026-07-03 05:23:31 | Re: [PATCH v2] Avoid internal error for invalid interval typmods |
| Previous Message | Shinya Kato | 2026-07-03 05:06:55 | Re: Add a statistics view to track usage of deprecated features |