| From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
|---|---|
| To: | Daniil Davydov <3danissimo(at)gmail(dot)com> |
| Cc: | cca5507 <2624345507(at)qq(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me> |
| Subject: | Re: Batching in executor |
| Date: | 2026-01-29 07:35:13 |
| Message-ID: | CA+HiwqH-2GmTKLW9kHdnqV4KdFiPfuAdVK2TgqOM2JaaeUYXnw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Here is v5 of the patch series.
Patches 0001-0003 add the core batching infrastructure. 0001 adds the
batch table AM API with heapam implementation, 0002 wires up SeqScan
to use it (still returning one slot at a time), and 0003 adds EXPLAIN
(BATCHES). I'd love to hear people's thoughts around TupleBatch
structure added in 0002. I thought about making it a separate patch so
that 0002 will still populate the single ScanState.ss_scanTupleSlot,
but that means we'd still have to call the TAM callback to populate
the tuple in the TAM's batch struct into the slot, defeating the whole
point. With TupleBatch, you have executor_batch_rows number of slots
which are filled in one TAM callback (materialize_all) call. So I
decided to keep the TupleBatch and related things in 0002.
For scans without quals, batching shows 20-30% improvement with no
visible regressions when batching is disabled (batch_rows=0):
SELECT * FROM t LIMIT n (no qual)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
(%diff: positive = faster than master, negative = slower)
Patches 0004-0005 add batched qual evaluation and are more
experimental (see below on why 0005 exists). For quals referencing
early columns, the improvement is significant:
SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 30.19 ms 15.55 ms 48.5%
3M 92.47 ms 50.01 ms 45.9%
10M 325.58 ms 211.83 ms 34.9%
However, for quals on later columns (e.g., 15th), batching provides no
benefit - deformation dominates and batching doesn't help:
SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 44.14 ms 44.56 ms -0.9%
3M 133.89 ms 137.77 ms -2.9%
10M 503.33 ms 528.88 ms -5.1%
I don't have a satisfactory explanation for why batching doesn't help
the deform-heavy case at all. One would expect at least some benefit
from reduced per-tuple overhead, but that's not materializing.
I've also been struggling to understand why 0004 affects the per-tuple
path even when batch_rows=0. For quals with 0% selectivity (all rows
fail the qual), perf shows ExecInterpExpr is noticeably hotter with
the patched code compared to master, even though batching is disabled:
SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
Compare that to 100% selectivity (all rows pass), where there's no regression:
SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
I tried moving batch opcodes to a separate interpreter (0005) thinking
it might be register pressure or jump table effects from adding cases
to ExecInterpExpr's switch. With 0005, the generated assembly for
ExecInterpExpr looks identical to master (same stack frame size, same
epilogue), yet the performance still differs. Specifically, the ldp
instruction in the function epilogue shows 53% hotness in patched vs
35% in master. We still need placeholder entries in the dispatch
table, so it's unclear if this fully isolates the per-tuple path. I'll
continue looking at perf, but I feel like at a bit of a loss here and
would appreciate any insights.
Other changes worth noting:
- I removed the BatchVector intermediate representation that copied
Datums into columnar arrays before qual evaluation (it used to be in
the batched qual patch 0004). Now quals access batch slots' tts_values
directly. This simplifies the code and the copy overhead wasn't paying
off. If we pursue serious vectorization later, this may need to be
revisited, but removing it doesn't degrade performance.
--
Thanks, Amit Langote
| Attachment | Content-Type | Size |
|---|---|---|
| v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch | application/octet-stream | 13.0 KB |
| v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch | application/octet-stream | 27.6 KB |
| v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch | application/octet-stream | 14.0 KB |
| v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch | application/octet-stream | 32.2 KB |
| v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch | application/octet-stream | 5.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | jian he | 2026-01-29 07:46:00 | Re: implement CAST(expr AS type FORMAT 'template') |
| Previous Message | Michael Paquier | 2026-01-29 07:25:31 | Re: [PATCH] Refactor *_abbrev_convert() functions |