| From: | Junwang Zhao <zhjwpku(at)gmail(dot)com> |
|---|---|
| To: | Amit Langote <amitlangote09(at)gmail(dot)com> |
| Cc: | Daniil Davydov <3danissimo(at)gmail(dot)com>, cca5507 <2624345507(at)qq(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me> |
| Subject: | Re: Batching in executor |
| Date: | 2026-02-01 14:49:31 |
| Message-ID: | CAEG8a3JKNRGwBb-C_YV6Px4aFHwd_G=AsB5AbfyU4-8YZT54AQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Amit,
On Thu, Jan 29, 2026 at 3:35 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
>
> Hi,
>
> Here is v5 of the patch series.
>
> Patches 0001-0003 add the core batching infrastructure. 0001 adds the
> batch table AM API with heapam implementation, 0002 wires up SeqScan
> to use it (still returning one slot at a time), and 0003 adds EXPLAIN
> (BATCHES). I'd love to hear people's thoughts around TupleBatch
> structure added in 0002. I thought about making it a separate patch so
> that 0002 will still populate the single ScanState.ss_scanTupleSlot,
> but that means we'd still have to call the TAM callback to populate
> the tuple in the TAM's batch struct into the slot, defeating the whole
> point. With TupleBatch, you have executor_batch_rows number of slots
> which are filled in one TAM callback (materialize_all) call. So I
> decided to keep the TupleBatch and related things in 0002.
>
> For scans without quals, batching shows 20-30% improvement with no
> visible regressions when batching is disabled (batch_rows=0):
>
> SELECT * FROM t LIMIT n (no qual)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
> 3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
> 10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
>
> (%diff: positive = faster than master, negative = slower)
>
> Patches 0004-0005 add batched qual evaluation and are more
> experimental (see below on why 0005 exists). For quals referencing
> early columns, the improvement is significant:
>
> SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
>
> Rows Master batch=64 %diff
> ------ -------- -------- -----
> 1M 30.19 ms 15.55 ms 48.5%
> 3M 92.47 ms 50.01 ms 45.9%
> 10M 325.58 ms 211.83 ms 34.9%
>
> However, for quals on later columns (e.g., 15th), batching provides no
> benefit - deformation dominates and batching doesn't help:
>
> SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
>
> Rows Master batch=64 %diff
> ------ -------- -------- -----
> 1M 44.14 ms 44.56 ms -0.9%
> 3M 133.89 ms 137.77 ms -2.9%
> 10M 503.33 ms 528.88 ms -5.1%
>
> I don't have a satisfactory explanation for why batching doesn't help
> the deform-heavy case at all. One would expect at least some benefit
> from reduced per-tuple overhead, but that's not materializing.
>
> I've also been struggling to understand why 0004 affects the per-tuple
> path even when batch_rows=0. For quals with 0% selectivity (all rows
> fail the qual), perf shows ExecInterpExpr is noticeably hotter with
> the patched code compared to master, even though batching is disabled:
>
> SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
> 3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
> 10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
>
> Compare that to 100% selectivity (all rows pass), where there's no regression:
>
> SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
> 3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
> 10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
>
> I tried moving batch opcodes to a separate interpreter (0005) thinking
> it might be register pressure or jump table effects from adding cases
> to ExecInterpExpr's switch. With 0005, the generated assembly for
> ExecInterpExpr looks identical to master (same stack frame size, same
> epilogue), yet the performance still differs. Specifically, the ldp
> instruction in the function epilogue shows 53% hotness in patched vs
> 35% in master. We still need placeholder entries in the dispatch
> table, so it's unclear if this fully isolates the per-tuple path. I'll
> continue looking at perf, but I feel like at a bit of a loss here and
> would appreciate any insights.
>
> Other changes worth noting:
>
> - I removed the BatchVector intermediate representation that copied
> Datums into columnar arrays before qual evaluation (it used to be in
> the batched qual patch 0004). Now quals access batch slots' tts_values
> directly. This simplifies the code and the copy overhead wasn't paying
> off. If we pursue serious vectorization later, this may need to be
> revisited, but removing it doesn't degrade performance.
>
> --
> Thanks, Amit Langote
Here are some comments for v5:
0001:
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
These function names are not consistent with comments.
0002:
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
ditto.
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
Is there any chance this batch_callbacks can be NULL? In that case it
can cause a segfault. I felt changing to
if (relation->rd_tableam && relation->rd_tableam->batch_callbacks)
should be more robust, but then I found table_slot_callbacks follow
the same pattern, so this shouldn't be a problem.
0003:
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>
I guess the reason for including this header is because of the use
of INT_MAX, so maybe put that line into execBatch.c?
--
Regards
Junwang Zhao
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Junwang Zhao | 2026-02-01 14:55:53 | Re: [PATCH] Fix error message in RemoveWalSummaryIfOlderThan to indicate file removal failure |
| Previous Message | zengman | 2026-02-01 14:43:27 | [PATCH] Fix error message in RemoveWalSummaryIfOlderThan to indicate file removal failure |