Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-03-28 00:35:24
Message-ID: CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 22, 2026 at 9:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> V17 is attached. This addresses most of your feedback, but defers
> dealing with the trickier parts for now.

Attached is v18, which addresses the tricky issues I skipped in v17.

Highlights:

* New commit (the first) that adds a heapam_indexscan.c file, and
moves all of the relevant existing heapam_handler.c functions over to
this new file. This commit is strictly mechanical, and has no
functional impact.

* Another new commit (the second) adds the new slot-based index scan
interface, and pushes the VM lookups that currently take place in
nodeIndexonlyScan.c into heapam. This includes the instrumentation
changes that were previously in their own small patch. It also
includes the VISITED_PAGES_LIMIT stuff -- there's no longer any commit
that breaks VISITED_PAGES_LIMIT, nor is there a later commit that
subsequently restores the functionality.

* We now extract one key piece of infrastructure that other table AMs
can now reuse to manage their own index scan batches: a new inline
function called tableam_util_fetch_next_batch now appears as an
indexbatch.h inline function.

This function (which appeared under a different name in v17) manages
deciding whether to call amgetbatch (performing the amgetbatch call
when needed) or to simply return the next batch in line (because such
a batch already exists/is already loaded in the ring buffer from
earlier). It also detects and handles cross-batch changes in scan
direction, and sets/reads the fields that track when a batch is
definitely the last one in the given scan direction.

I tried to go further by also placing the bulk of the code in
heapam_index_getnext_scanbatch_pos in indexbatch.h, but ultimately
concluded that it wasn't worth it. There isn't much code in
heapam_index_getnext_scanbatch_pos to generalize. Some of that code
coordinates quite directly with the heapam read stream callback.

* Added an "isGuarded" field to IndexScanBatchData. That way we can
assert that any batch loaded within the read stream callback is
already unguarded/has no batch index page buffer pin still held. The
field isn't just for assertions, though. We also rely on it in one or
two places during release builds, which seemed clearer to me. For
example, we no longer require that amunguardbatch routines be
idempotent, which seems like a minor improvement.

* Other work is broken out into its own commit, to better highlight
issues that aren't strictly related to the amgetbatch changes.
Specifically, a new commit adds the "don't unpin on rescan" behavior.
These changes, which first appeared in the main amgetbatch commit, aim
to minimize regressions with cached workloads (this was discussed
briefly upthread, plus Andres and I discussed it over IM some weeks
back).

* Replaced what was previously an assert with an equivalent
pg_assume(), within heapam_index_getnext_scanbatch_pos -- which is a
very hot function. We now "pg_assume(batchringbuf->headBatch ==
scanPos->batch)", allowing the compiler to exploit this existing
invariant. This seems to help the compiler generate better code,
presumably by allowing more efficient register allocation. I haven't
looked at the disassembly just yet, but it helps with certain
microbenchmarks enough to matter.

* The "Add UnlockBufferGetLSN utility function" commit now uses an
atomic operation (in the common case where it actually drops the index
page's pin), just like UnlockReleaseBuffer itself following Andres'
commit f39cb8c011062d65e146c1e9d1aae221e96d8320 from today.

I completed this item quickly, after pulling from master a little
while ago. The expectation for UnlockBufferGetLSN is that it be at
least as good as a BufferGetLSNAtomic followed by a
UnlockReleaseBuffer, so IMV it made sense to include this last minute
addition in v18.

* The "don't wait" patches are no longer included, since Andres
committed them to the master branch just now. Thanks to both you
(Andres) and Melanie for taking care of this very thorny issue!

I believe I have now acted on all of Andres' feedback, with one minor
exception: Andres disliked how we use 'if (scan->xs_heapfetch)' to
determine whether to use the scan's batch cache (we don't want to use
it at all if the scan is ending). I just ran out of steam, so this
version doesn't address the problem. But I'm still tracking it as an
open item (I don't disagree with Andres, but this problem is slightly
awkward). Note that I *did* create a helper function here, so we at
least avoid having 2 copies of the loop that looks for a free cache
batch slot.

--
Peter Geoghegan

Attachment Content-Type Size
v18-0017-Dirty-hack-to-make-read_stream_reset-end-not-wai.patch application/octet-stream 4.4 KB
v18-0016-WIP-instrumentation-Account-for-resource-usage-u.patch application/octet-stream 1.4 KB
v18-0019-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch application/octet-stream 3.1 KB
v18-0018-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch application/octet-stream 6.3 KB
v18-0015-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch application/octet-stream 4.1 KB
v18-0013-WIP-read_stream-Only-increase-distance-when-wait.patch application/octet-stream 2.3 KB
v18-0012-bufmgr-Return-whether-WaitReadBuffers-needed-to-.patch application/octet-stream 2.8 KB
v18-0011-WIP-aio-io_uring-Allow-IO-methods-to-check-if-IO.patch application/octet-stream 4.6 KB
v18-0010-WIP-read_stream-Issue-IO-synchronously-while-in-.patch application/octet-stream 2.0 KB
v18-0014-WIP-read_stream-Prevent-distance-from-decaying-t.patch application/octet-stream 2.9 KB
v18-0008-heapam-Add-index-scan-I-O-prefetching.patch application/octet-stream 44.2 KB
v18-0009-Make-hash-index-AM-use-amgetbatch-interface.patch application/octet-stream 47.3 KB
v18-0007-Add-UnlockBufferGetLSN-utility-function.patch application/octet-stream 4.8 KB
v18-0006-heapam-Optimize-pin-transfers-during-index-scans.patch application/octet-stream 6.6 KB
v18-0005-Add-interfaces-that-enable-index-prefetching.patch application/octet-stream 234.0 KB
v18-0002-Add-slot-based-table-AM-interface-for-index-scan.patch application/octet-stream 56.4 KB
v18-0004-heapam-Keep-buffer-pins-across-index-scan-rescan.patch application/octet-stream 3.4 KB
v18-0003-heapam-Track-heap-block-in-IndexFetchHeapData.patch application/octet-stream 4.6 KB
v18-0001-Move-heapam_handler.c-index-scan-code-to-new-fil.patch application/octet-stream 9.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2026-03-28 00:37:07 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
Previous Message Tomas Vondra 2026-03-28 00:14:41 Re: Shared hash table allocations