Quick Links

Re: index prefetching

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2026-03-21 00:33:02
Message-ID:	t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-03-19 13:44:16 -0400, Peter Geoghegan wrote:
> The other notable change aims to improve performance for cached
> workloads through more aggressive specialization of callback routines
> like heapam_index_getnext_slot. There are now a total of 4 versions:
>
> 1. One for index-only scans + amgetbatch.
> 2. One for one for index-only scans + amgettuple.
> 3. One for plain index scans + amgetbatch.
> 4. One for plain index scans + amgettuple.
>
> I've also aggressively used pg_attribute_always_inline (without this
> specialization, performance ends up worse rather than better).
>
> There might be some concerns about the distributed costs of increased
> code size.

I'm not particularly concerned about that - I think it actually might often
*increase* the icache hit ratio, because without this specialization every
scan has to jump over a fair bit of code not used during that scan.

> From 8bcd4eb22a6479302fecaf5088c1f03f43f1f407 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Tue, 10 Mar 2026 14:22:25 -0400
> Subject: [PATCH v16 01/17] Make IndexScanInstrumentation a pointer in executor
> scan nodes.
>
> Change the IndexScanInstrumentation fields in IndexScanState,
> IndexOnlyScanState, and BitmapIndexScanState from inline structs to
> pointers. This avoids adding additional overhead as new fields are
> added to IndexScanInstrumentation, at least in the common case where the
> instrumentation isn't used (i.e. when the executor node isn't being run
> through an EXPLAIN ANALYZE).

LGTM

> From 997822891f393a13be22c1be2d2d0094a485fddf Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Tue, 10 Mar 2026 14:40:35 -0400
> Subject: [PATCH v16 02/17] heapam: Track heap block in IndexFetchHeapData
> using xs_blk
>
> Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that
> tracks which heap block is currently pinned in xs_cbuf.
>
> heapam_index_fetch_tuple now uses xs_blk to determine when buffer
> switching is needed, replacing the previous approach that compared
> buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call.

> The new approach skips the buffer-switching path entirely when the next
> TID is on the same heap page, which also means that heap_page_prune_opt
> is called exactly once per block (when the block is first pinned).

I think that was already almost always the case, due to the
if (prev_buf != hscan->xs_cbuf)

> This is preparatory work for an upcoming commit that will need xs_blk
> to manage buffer pin transfers between the scan and the executor slot.

A subsequent commit adds an earlier ExecClearTuple(slot) to make the buffer
refcount decrement cheaper (due to hitting the one-element cache in
bufmgr.c). I wonder if it's worth pulling that into this commit? Mostly to
make that larger commit smaller.

> From c83aef317ba452b7631feee10062d92f33a64930 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Tue, 9 Sep 2025 19:50:03 -0400
> Subject: [PATCH v16 03/17] Add interfaces that enable index prefetching.

>
> -/* mark current scan position */
> -typedef void (*ammarkpos_function) (IndexScanDesc scan);
> -
> -/* restore marked scan position */
> -typedef void (*amrestrpos_function) (IndexScanDesc scan);
> +/* invalidate index AM state that independently tracks scan's position */
> +typedef void (*amposreset_function) (IndexScanDesc scan,
> + IndexScanBatch batch);
>
> /*
> * Callback function signatures - for parallel index scans.
> @@ -309,10 +320,12 @@ typedef struct IndexAmRoutine
> ambeginscan_function ambeginscan;
> amrescan_function amrescan;
> amgettuple_function amgettuple; /* can be NULL */
> + amgetbatch_function amgetbatch; /* can be NULL */
> + amkillitemsbatch_function amkillitemsbatch; /* can be NULL */
> + amunguardbatch_function amunguardbatch; /* can be NULL */
> amgetbitmap_function amgetbitmap; /* can be NULL */
> amendscan_function amendscan;
> - ammarkpos_function ammarkpos; /* can be NULL */
> - amrestrpos_function amrestrpos; /* can be NULL */
> + amposreset_function amposreset; /* can be NULL */
>
> /* interface functions to support parallel index scans */
> amestimateparallelscan_function amestimateparallelscan; /* can be NULL */

The commit message doesn't mention that this affects ammarkpos/amrestrpos.

Does that imply that index AMs that want to support mark/restore need to
switch away from amgettuple? Probably fine, but worth mentioning, I think.

> diff --git a/src/include/access/genam.h b/src/include/access/genam.h
> index 1a27bf060..6b5c48e5e 100644
> --- a/src/include/access/genam.h
> +++ b/src/include/access/genam.h
> @@ -96,6 +96,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
>
> +/*
> + * amgetbatch utilities called by indexam.c (in indexbatch.c)
> + */
> +extern void index_batchscan_init(IndexScanDesc scan);
> +extern void index_batchscan_reset(IndexScanDesc scan);
> +extern void index_batchscan_end(IndexScanDesc scan);
> +extern void index_batchscan_mark_pos(IndexScanDesc scan);
> +extern void index_batchscan_restore_pos(IndexScanDesc scan);
> +
> +/*
> + * amgetbatch utilities called by table AMs (in indexbatch.c)
> + */
> +extern void tableam_util_batch_dirchange(IndexScanDesc scan);
> +extern void tableam_util_kill_scanpositem(IndexScanDesc scan);
> +extern void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch);
> +extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch);
> +
> +/*
> + * amgetbatch utilities called by index AMs (in indexbatch.c)
> + */
> +extern void indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch,
> + Buffer buf);
> +extern IndexScanBatch indexam_util_batch_alloc(IndexScanDesc scan);
> +extern void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch);
> +
> #endif /* GENAM_H */

I wonder if these should be in a separate header, a reasonably large number of
files include genam.h due to systable_beginscan() etc, and those should not
need to know about this stuff.

> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 9bcf74862..55c9069d6 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -129,10 +129,37 @@ typedef struct IndexFetchHeapData
> Buffer xs_cbuf;
> BlockNumber xs_blk;
>
> - /* Current heap block's corresponding page in the visibility map */
> - Buffer xs_vmbuffer;
> + /* For index-only scans that must access the visibility map */
> + Buffer xs_vmbuffer; /* visibility map buffer */
> + int xs_vm_items; /* # items to resolve visibility info for */
> + bool xs_lastinblock; /* last TID on this block in current batch? */
> +
> } IndexFetchHeapData;

FWIW, Melanie is going to use xs_vmbuffer for on-access pruning soon, maybe
worth not phrasing it as specific to IOSs as it will need to be revised anyway.

> + typedef struct HeapBatchData
> + {
> + uint8 *visInfo; /* per-item visibility flags, or NULL */
> + } HeapBatchData;
> +
> + /*
> + * Per-item visibility flags stored in HeapBatchData.visInfo array
> + */
> + #define HEAP_BATCH_VIS_CHECKED 0x01 /* checked item in VM? */
> + #define HEAP_BATCH_VIS_ALL_VISIBLE 0x02 /* block is known all-visible? */

So we only 2 out of 8 bits. I guess it's probably not worth doing bit shift
stuff. But still curious if you tried?

> diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
> index da7503c57..74af9efc3 100644
> --- a/src/include/access/nbtree.h
> +++ b/src/include/access/nbtree.h
> @@ -924,111 +924,27 @@ typedef struct BTVacuumPostingData
>
> + * Access the btree-private per-batch data from an IndexScanBatch pointer.
> + * This follows the standard convention for index AM opaque state: it can be
> + * found at a fixed negative offset from the IndexScanBatch pointer.
> */

> +static inline BTBatchData *
> +BTBatchGetData(IndexScanBatch batch)
> {
> + return (BTBatchData *) ((char *) batch - MAXALIGN(sizeof(BTBatchData)));
> +}

I think it maybe worth threading this through a generic helper macro taking
IndexScanDesc, IndexScanBatch, and the type of the private patch data that
does the pointer math and also can assert that
iscan->batch_index_opaque_size == MAXALIGN(sizeof(BTBatchData))

> +/*
> + * Data about one batch of items returned by (and passed to) amgetbatch during
> + * index scans.
> + *
> + * Each batch allocation has the following memory layout:
> + *
> + * [table AM opaque area] <- at -(batch_table_offset) from batch ptr
> + * [index AM opaque area] <- at -(batch_index_opaque_size) from batch ptr
> + * [IndexScanBatchData] <- the returned pointer
> + * [items[maxitemsbatch]]
> + * [table AM trailing data] <- e.g. per-item visibility flags
> + * [currTuples workspace] <- sized by index AM (batch_tuples_workspace)
> + *
> + * The AM-specific opaque areas are accessed via accessor functions defined by
> + * each table AM and index AM that supports the batch interfaces.
> + */
> +typedef struct IndexScanBatchData
> +{
> + XLogRecPtr lsn; /* index page's LSN */

Still doubtful this should be in generic infra (together with
indexam_util_batch_unlock()).

> +/*
> + * Maximum number of batches (leaf pages) we can keep in memory. We need a
> + * minimum of two, since we'll only consider releasing one batch when another
> + * is read.
> + *
> + * The current maximum of 64 batches is somewhat of an arbitrary limit. Very
> + * few scans ever get near to this limit in practice.
> + */
> +#define INDEX_SCAN_MAX_BATCHES 64
> +#define INDEX_SCAN_CACHE_BATCHES 2

Just curious: How did you end up with INDEX_SCAN_CACHE_BATCHES == 2?

> +/*
> + * State used by table AMs to manage an index scan that uses the amgetbatch
> + * interface. Scans use a ring buffer of batches returned by amgetbatch.
> + *
> + * Batches are kept in the order that they were returned in by amgetbatch,
> + * since that is the same order that table_index_getnext_slot will return
> + * matches in. However, table AMs are free to fetch table tuples in whatever
> + * order is most convenient/efficient -- provided that such reordering cannot
> + * affect the order that table_index_getnext_slot later returns tuples in.
> + */
> +typedef struct BatchRingBuffer
> +{
> + /* current positions in batches[] for scan */
> + BatchRingItemPos scanPos; /* scan's read position */
> + BatchRingItemPos markPos; /* mark/restore position */
> +
> + IndexScanBatch markBatch;
> +
> + /*
> + * headBatch is an index to the earliest still-valid batch in 'batches'.
> + * In practice this must be the scan's current scanPos batch (scanBatch).
> + */
> + uint8 headBatch;
> +
> + /*
> + * nextBatch is an index to the next empty batch slot in 'batches'. This
> + * is only actually usable when the scan is !index_scan_batch_full().
> + */
> + uint8 nextBatch;

Is nextBatch basically the tail?

The "only actually usable" bit is a bit confusing, given that
index_scan_batch_full() is implemented by
(nextBatch - headBatch) == INDEX_SCAN_MAX_BATCHES

Maybe worth having a static assert somewhere to ensure that
INDEX_SCAN_MAX_BATCHES isn't set to a value bigger than what fits into a
uint8?

> +/*
> + * Append given batch to scan's batch ring buffer.
> + */
> +static inline void
> +index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch)
> +{
> + BatchRingBuffer *ringbuf = &scan->batchringbuf;
> + uint8 nextBatch = ringbuf->nextBatch;
> +
> + ringbuf->batches[nextBatch & (INDEX_SCAN_MAX_BATCHES - 1)] = batch;
> + ringbuf->nextBatch++;
> +}

Perhaps worth an assert ensuring that there's space in the ring buffer?

> diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
> index 8847d7f94..b5b8f509a 100644
> --- a/src/include/executor/instrument_node.h
> +++ b/src/include/executor/instrument_node.h
> @@ -48,6 +48,12 @@ typedef struct IndexScanInstrumentation
> {
> /* Index search count (incremented with pgstat_count_index_scan call) */
> uint64 nsearches;
> +
> + /*
> + * heap blocks fetched counts (incremented by index_getnext_slot calls
> + * within table AMs, though only during index-only scans)
> + */
> + uint64 nheapfetches;
> } IndexScanInstrumentation;

s/heap/table/ for anything new imo.

> +/*
> + * heapam_batch_resolve_visibility
> + * Obtain visibility information for a TID from caller's batch.

I really dislike these pointless restatements of the function name and the
weird indentation of the function "title" that's indented differently across
files and in the file. Since they're not in use in heapam_handler.c can we
please not introduce them now?

> + *
> + * Called during index-only scans. We always check the visibility of caller's
> + * item (an offset into caller's batch->items[] array). We might also set
> + * visibility info for other items from caller's batch more proactively when
> + * that makes sense.
> + *
> + * We keep two competing considerations in balance when determining whether to
> + * check additional items: the need to keep the cost of visibility map access
> + * under control when most items will never be returned by the scan anyway
> + * (important for inner index scans of anti-joins and semi-joins), and the
> + * need to not hold onto index leaf pages for too long.
> + *
> + * Note on Memory Ordering Effects
> + * -------------------------------
> + *
> + * visibilitymap_get_status does not lock the visibility map buffer, and
> + * therefore the result we read here could be slightly stale. However, it
> + * can't be stale enough to matter.
> + *
> + * We need to detect clearing a VM bit due to an insert right away, because
> + * the tuple is present in the index page but not visible. The reading of the
> + * TID by this scan (using a shared lock on the index buffer) is serialized
> + * with the insert of the TID into the index (using an exclusive lock on the
> + * index buffer). Because the VM bit is cleared before updating the index,
> + * and locking/unlocking of the index page acts as a full memory barrier, we
> + * are sure to see the cleared bit if we see a recently-inserted TID.
> + *
> + * Deletes do not update the index page (only VACUUM will clear out the TID),
> + * so the clearing of the VM bit by a delete is not serialized with this test
> + * below, and we may see a value that is significantly stale. However, we
> + * don't care about the delete right away, because the tuple is still visible
> + * until the deleting transaction commits or the statement ends (if it's our
> + * transaction). In either case, the lock on the VM buffer will have been
> + * released (acting as a write barrier) after clearing the bit. And for us to
> + * have a snapshot that includes the deleting transaction (making the tuple
> + * invisible), we must have acquired ProcArrayLock after that time, acting as
> + * a read barrier.
> + *
> + * It's worth going through this complexity to avoid needing to lock the VM
> + * buffer, which could cause significant contention.
> + */

ISTM that moving this comment here is hiding it too much, I'm pretty sure
there are other AMs using visibilitymap out there. Perhaps we should just
move it to the top of visibilitymap.c?

> +static void
> +heapam_batch_resolve_visibility(IndexScanDesc scan, ScanDirection direction,
> + IndexScanBatch batch, HeapBatchData *hbatch,
> + BatchRingItemPos *pos)
> +{
> + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> + int posItem = pos->item;
> + int noSetItem,
> + step;
> + bool allbatchitemvisible;
> + BlockNumber curvmheapblkno = InvalidBlockNumber;
> + uint8 curvmheapblkflags = 0;
> +
> + Assert(hbatch == heap_batch_data(batch, scan));
> +
> + /*
> + * We better still have the TID recycling interlock (generally a pin on
> + * the batch's index page) -- amunguardbatch has not been called yet
> + */
> + Assert(!scan->batchImmediateUnguard);
> +
> + /* Determine the range of items to set visibility for */
> + if (ScanDirectionIsForward(direction))
> + {
> + noSetItem = Min(batch->lastItem + 1, posItem + hscan->xs_vm_items);
> + allbatchitemvisible = noSetItem > batch->lastItem &&
> + (posItem == batch->firstItem ||
> + (hbatch->visInfo[batch->firstItem] & HEAP_BATCH_VIS_CHECKED));
> + step = 1;
> + }
> + else
> + {
> + noSetItem = Max(batch->firstItem - 1, posItem - hscan->xs_vm_items);
> + allbatchitemvisible = noSetItem < batch->firstItem &&
> + (posItem == batch->lastItem ||
> + (hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED));
> + step = -1;
> + }
> +
> + /*
> + * Set visibility info for a range of items, in scan order.
> + *
> + * noSetItem is the first item (in the given scan direction) that won't be
> + * set during this call. noSetItem often points to just past the end of
> + * (or just before the start of) the batch's 'items' array.
> + *
> + * We iterate this way to avoid the need for 2 direction-specific loops,
> + * since this is a hot code path that's sensitive to code size increases.
> + */

I'm surprised this is better than two loops, tbh. It makes the actual looping
code more complicated, with higher register pressure.

> +/* ----------------
> + * heapam_batch_getnext - get the next batch of TIDs from a scan
> + *
> + * Called when we need to load the next batch of index entries to process in
> + * the given direction. Caller passes us a batch and a batch position, which
> + * has just been used to read all items from the batch in the direction passed
> + * by caller.
> + *
> + * Returns the next batch to be processed by the index scan, or NULL when
> + * there are no more matches in the given scan direction. Does not advance
> + * caller's batch position; that is left up to caller.
> + *
> + * This is also where batches are appended to the scan's ring buffer. We
> + * don't free any batches here, though; that is also left up to caller.
> + * ----------------
> + */
> +static pg_attribute_hot IndexScanBatch
> +heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
> + IndexScanBatch priorBatch, BatchRingItemPos *pos)
> +{

Kinda seems nothing (except the one assert below) in here is heap specific?
I.e. that every table AM will need pretty exactly this? Seems like it ought
to be somewhere where it can be reused. Could be a static inline in an
indexbatch.h file or such if inlining were desirable.

> +/* ----------------
> + * heapam_batch_getnext_tid - get next TID from batch ring buffer
> + *
> + * Get the next TID from the scan's batch ring buffer, when moving in the
> + * given scan direction.
> + * ----------------
> + */
> +static pg_attribute_hot pg_attribute_always_inline ItemPointer
> +heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
> + ScanDirection direction, bool *all_visible)
> +{
> + BatchRingBuffer *batchringbuf = &scan->batchringbuf;
> + BatchRingItemPos *scanPos = &batchringbuf->scanPos;
> + IndexScanBatch scanBatch = NULL;
> +
> + Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch);
> + Assert(scanPos->valid || index_scan_batch_count(scan) == 0);
> + Assert(all_visible == NULL || scan->xs_want_itup);
> +
> + /*
> + * Check if there's an existing loaded scanBatch for us to return the next
> + * matching item's TID/index tuple from
> + */
> + if (scanPos->valid)
> + {
> + /*
> + * scanPos is valid, so scanBatch must already be loaded in batch ring
> + * buffer. We rely on that here.
> + */
> + Assert(batchringbuf->headBatch == scanPos->batch);
> +
> + scanBatch = index_scan_batch(scan, scanPos->batch);
> +
> + if (index_scan_pos_advance(direction, scanBatch, scanPos))
> + return heapam_batch_return_tid(scan, hscan, direction,
> + scanBatch, scanPos,
> + all_visible);
> + }
> +
> + /*
> + * Either ran out of items from our existing scanBatch, or it hasn't been
> + * loaded yet (because this is the first call here for the entire scan).
> + * Try to advance scanBatch to the next batch (or get the first batch).
> + */
> + scanBatch = heapam_batch_getnext(scan, direction, scanBatch, scanPos);
> +
> + if (!scanBatch)
> + {
> + /*
> + * We're done; no more batches in the current scan direction.
> + *
> + * Note: scanPos is generally still valid at this point. The scan
> + * might still back up in the other direction.
> + */
> + return NULL;
> + }
> +
> + /*
> + * Advanced scanBatch. Now position scanPos to the start of new
> + * scanBatch.
> + */
> + index_scan_pos_nextbatch(direction, scanBatch, scanPos);
> + Assert(index_scan_batch(scan, scanPos->batch) == scanBatch);
> +
> + /*
> + * Remove the head batch from the batch ring buffer (except when this new
> + * scanBatch is our only one)
> + */
> + if (batchringbuf->headBatch != scanPos->batch)
> + {
> + IndexScanBatch headBatch = index_scan_batch(scan,
> + batchringbuf->headBatch);
> +
> + /* free obsolescent head batch (unless it is scan's markBatch) */
> + tableam_util_free_batch(scan, headBatch);
> +
> + /* Remove the batch from the ring buffer */
> + batchringbuf->headBatch++;
> + }
> +
> + /* In practice scanBatch will always be the ring buffer's headBatch */
> + Assert(batchringbuf->headBatch == scanPos->batch);
> +
> + return heapam_batch_return_tid(scan, hscan, direction, scanBatch, scanPos,
> + all_visible);
> +}
> +

A lot of this also seems like it should be generic code. Where it actually
starts to be heapam specific is in heapam_batch_return_tid() - and there's two
calls to that. So maybe there should be generic helper for the bulk of
heapam_batch_getnext_tid() that's called by the heapam version, which either
returns a NULL upwards or does a single heapam_batch_return_tid().

> +/* ----------------
> + * heapam_index_fetch_heap - get the scan's next heap tuple
> + *
> + * The result is a visible heap tuple associated with the index TID most
> + * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
> + * matching tuples exist. (There can be more than one matching tuple because
> + * of HOT chains, although when using an MVCC snapshot it should be impossible
> + * for more than one such tuple to exist.)
> + *
> + * On success, the buffer containing the heap tup is pinned. The pin must be
> + * dropped elsewhere.
> + * ----------------
> + */
> +static pg_attribute_hot pg_attribute_always_inline bool

From what I can tell it doesn't make sense to specify both pg_attribute_hot
pg_attribute_always_inline as the the pg_attribute_hot seems to be disregarded
once inlined into the caller.

> +heapam_index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot,
> + bool *heap_continue, bool usebatchring)
> +{
> + bool all_dead = false;
> + bool found;
> +
> + found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
> + scan->xs_snapshot, slot,
> + heap_continue, &all_dead);
> +
> + if (found)
> + pgstat_count_heap_fetch(scan->indexRelation);
> +
> + /*
> + * If we scanned a whole HOT chain and found only dead tuples, remember it
> + * for later. We do not do this when in recovery because it may violate
> + * MVCC to do so. See comments in RelationGetIndexScan().
> + */
> + if (!scan->xactStartedInRecovery)
> + {
> + if (usebatchring)
> + {
> + if (all_dead)
> + tableam_util_kill_scanpositem(scan);
> + }
> + else
> + {
> + /*
> + * Tell amgettuple-based index AM to kill its entry for that TID
> + * (this will take effect in the next call, in index_getnext_tid)
> + */
> + scan->kill_prior_tuple = all_dead;
> + }
> + }
> +
> + return found;
> +}

> +/* ----------------
> + * heapam_index_getnext_slot - plain index scan, amgettuple path
> + *
> + * The result is true if a tuple satisfying the scan keys and the snapshot was
> + * found, false otherwise. The tuple is stored in the specified slot.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * dropped by a future call here (or by a later call to index_endscan).
> + *
> + * Note: caller must check scan->xs_recheck, and perform rechecking of the
> + * scan keys if required. We do not do that here because we don't have
> + * enough information to do it efficiently in the general case.
> + * ----------------
> + */
> +static pg_attribute_hot bool
> +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> + TupleTableSlot *slot)
> +{
> + ItemPointer tid;
> + bool heap_continue = false;
> +
> + for (;;)
> + {
> + if (!heap_continue)
> + {
> + tid = index_getnext_tid(scan, direction);
> +
> + /* If we're out of index entries, we're done */
> + if (tid == NULL)
> + break;
> + }
> +
> + /*
> + * Fetch the next (or only) visible heap tuple for this index entry.
> + * If we don't find anything, loop around and grab the next TID from
> + * the index.
> + */
> + Assert(ItemPointerIsValid(&scan->xs_heaptid));
> + if (heapam_index_fetch_heap(scan, slot, &heap_continue, false))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/* ----------------
> + * heapam_index_batch_getnext_slot - plain index scan, amgetbatch path
> + *
> + * Like heapam_index_getnext_slot, but uses the batch ring buffer to get TIDs
> + * from the index AM instead of amgettuple.
> + * ----------------
> + */
> +static pg_attribute_hot bool
> +heapam_index_batch_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> + TupleTableSlot *slot)
> +{
> + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> + ItemPointer tid;
> + bool heap_continue = false;
> +
> + Assert(!scan->xs_want_itup);
> +
> + for (;;)
> + {
> + if (!heap_continue)
> + {
> + tid = heapam_batch_getnext_tid(scan, hscan, direction, NULL);
> +
> + /* If we're out of index entries, we're done */
> + if (tid == NULL)
> + break;
> + }
> +
> + /*
> + * Fetch the next (or only) visible heap tuple for this index entry.
> + * If we don't find anything, loop around and grab the next TID from
> + * the index.
> + */
> + Assert(ItemPointerIsValid(&scan->xs_heaptid));
> + if (heapam_index_fetch_heap(scan, slot, &heap_continue, true))
> + return true;
> + }
> +
> + return false;
> +}

Have you thought about implementing the different variants of this by having
one common always-inline implementation that is called by the different
variants, with constant arguments to decide between e.g. index_getnext_tid()
and heapam_batch_getnext_tid()?

I find it pretty hard to keep the flow below
heapam_index_[only_][batch_]getnext_slot() straight.

- heapam_index_fetch_heap()
- heapam_index_fetch_tuple()
- heapam_batch_return_tid()
- heapam_batch_getnext()
- heapam_batch_getnext_tid()

Maybe I'm just weak brained today, but from the naming I have no idea what the
difference between e.g. heapam_index_fetch_heap(), heapam_index_fetch_tuple()
could be and similarly the difference between heapam_batch_return_tid() and
heapam_batch_getnext_tid() is unclear, and heapam_batch_getnext() isn't
particularly clear either.

Later there also is
- heapam_getnext_stream

I also suspect it'd be worth creating a new heapam.c file for this new
code. heapam_index.c or such.

> IndexScanDesc
> index_beginscan(Relation heapRelation,
> Relation indexRelation,
> + bool xs_want_itup,
> Snapshot snapshot,
> IndexScanInstrumentation *instrument,
> int nkeys, int norderbys)

xs_want_itup in a generic routine seems too low-level. The xs_ stuff is just
the struct prefix that was once uniformly used for IndexScanDescData members,
it doesn't seem to belong in a parameter.

Could it just be an index_only_scan or such?

> @@ -281,7 +280,23 @@ index_beginscan(Relation heapRelation,
> */
> scan->heapRelation = heapRelation;
> scan->xs_snapshot = snapshot;
> + scan->MVCCScan = IsMVCCLikeSnapshot(snapshot);

Elsewhere you have an xxx comment about making sure the snapshot is
pushed/registered - seems it should be here, not there...

> @@ -612,7 +644,23 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
> */
> scan->heapRelation = heaprel;
> scan->xs_snapshot = snapshot;
> + scan->MVCCScan = IsMVCCLikeSnapshot(snapshot);
> scan->instrument = instrument;
> + scan->xs_want_itup = xs_want_itup;
> + scan->batchImmediateUnguard = (scan->MVCCScan && !xs_want_itup);
> +
> + if (indexrel->rd_indam->amgetbatch != NULL)
> + index_batchscan_init(scan);
> +
> + /* Resolve which getnext_slot implementation to use for this scan */
> + if (xs_want_itup)
> + scan->xs_getnext_slot = scan->usebatchring ?
> + heaprel->rd_tableam->index_only_batch_getnext_slot :
> + heaprel->rd_tableam->index_only_getnext_slot;
> + else
> + scan->xs_getnext_slot = scan->usebatchring ?
> + heaprel->rd_tableam->index_batch_getnext_slot :
> + heaprel->rd_tableam->index_getnext_slot;

Any reason this isn't in index_beginscan_internal(), given both
index_beginscan() and index_beginscan_parallel() need it? I realize you'd
need to add arguments to index_beginscan_internal(), but I don't see a problem
with that. Alternatively a helper for this seems like a possibility too.

> /* ----------------
> - * index_getnext_tid - get the next TID from a scan
> + * index_getnext_tid - amgettuple interface
> *
> * The result is the next TID satisfying the scan keys,
> * or NULL if no more matching tuples exist.
> + *
> + * This should only be called by table AM's index_getnext_slot implementation,
> + * and only given an index AM that supports the single-tuple amgettuple
> + * interface.
> * ----------------

Seems like this comment should also be turned into assertions?

> +void
> +index_batchscan_mark_pos(IndexScanDesc scan)
> +{
> + else
> + {
> + /*
> + * It looks like markBatch is loaded/still needed within batchringbuf.
> + *
> + * index_scan_batch_loaded indicates that markpos->batch is loaded
> + * already, but we cannot fully trust it here. It's just about
> + * possible that markpos->batch falls within a since-recycled range of
> + * batch offset numbers (following uint8 overflow).

> + * Make sure that markBatch really is loaded by directly comparing it
> + * against all loaded batches. We must not fail to release markBatch
> + * when nobody else will later on.
> + *
> + * Note: in practice we're very unlikely to end up here. It is very
> + * atypical for an index scan on the inner side of a merge join to
> + * hold on to a mark that trails the current scanBatch this much.
> + */

Hm. Would it be worth using a much wider ring position to avoid this kind of
danger?

There's not a lot of instances of the batch positions, so it'd not be a lot of
wasted memory. And index_scan_batch() already does [idx & (INDEX_SCAN_MAX_BATCHES - 1)]
which should be just as cheap with 64bit as with 8 bit (on systems we care
about at least).

I don't think it'd really be an advantage runtime wise, but it seems like it
might be a bit less fragile that way.

> + freeMarkBatch = true; /* i.e. index_scan_batch_loaded lied to us */
> +
> + for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
> + {
> + if (index_scan_batch(scan, i) == markBatch)
> + {
> + /* index_scan_batch_loaded was right/no overflow happened */
> + freeMarkBatch = false;
> + break;
> + }
> + }
> + }
> +
> + if (freeMarkBatch)
> + {
> + /* Free markBatch, since it isn't loaded/needed for batchringbuf */
> + batchringbuf->markBatch = NULL; /* else call won't free markBatch */
> + tableam_util_free_batch(scan, markBatch);
> + }
> +
> + /* copy the scan's position */
> + batchringbuf->markPos = *scanPos;
> + batchringbuf->markBatch = scanBatch;
> +}
> +
> +/*
> + * Restore mark to scanPos position
> + *
> + * Restores the scan to a position saved by index_batchscan_mark_pos earlier.
> + * The scan's markPos becomes its scanPos. The marked batch is restored as
> + * the current scanBatch when needed.
> + *
> + * We just discard all batches (other than markBatch/restored scanBatch),
> + * except when markBatch is already the scan's current scanBatch.
> + */
> +void
> +index_batchscan_restore_pos(IndexScanDesc scan)
> +{
> + BatchRingBuffer *batchringbuf = &scan->batchringbuf;
> + BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
> + BatchRingItemPos *markPos = &batchringbuf->markPos;
> + IndexScanBatch markBatch = batchringbuf->markBatch;
> + IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
> +
> + Assert(scan->MVCCScan);
> + Assert(scan->xs_heapfetch);
> + Assert(markPos->valid);
> +
> + if (scanBatch == markBatch)
> + {
> + /* markBatch is already scanBatch; needn't change batchringbuf */
> + Assert(scanPos->batch == markPos->batch);
> +
> + scanPos->item = markPos->item;
> + return;
> + }
> +
> + /*
> + * markBatch is behind scanBatch, and so must not be saved in ring buffer
> + * anymore. We have to deal with restoring the mark the hard way: by
> + * invalidating all other loaded batches. This is similar to the case
> + * where the scan direction changes and the scan actually crosses
> + * batch/index page boundaries (see tableam_util_batch_dirchange).
> + *
> + * First, free all batches that are still in the ring buffer.
> + */
> + for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
> + {
> + IndexScanBatch batch = index_scan_batch(scan, i);
> +
> + Assert(batch != markBatch);
> +
> + tableam_util_free_batch(scan, batch);
> + }

Couldn't it be that markBatch is one behind scanBatch, in which case wouldn't
need to throw out all the prefetched batches (in the future commits that do
prefetching)? My understanding is that it'd be rather likely that, if the
markPos is not the current batch, it'd be in the last batch.

> +/*
> + * Handle cross-batch change in scan direction
> + *
> + * Called by table AM when its scan changes direction in a way that
> + * necessitates backing the scan up to an index page originally associated
> + * with a now-freed batch.
> + *
> + * When we return, batchringbuf will only contain one batch (the current
> + * headBatch/scanBatch). Caller can then safely pass this batch to amgetbatch
> + * to determine which batch comes next in the new scan direction. From that
> + * point on batchringbuf will look as if our new scan direction had been used
> + * from the start. This approach isn't particularly efficient, but it works
> + * well enough for what ought to be a relatively rare occurrence.
> + */

Do we have decent test coverage of queries changing scan direction? Seems like
something that could quite easily have bugs.

> +void
> +tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
> +{
> + /* don't free caller's batch if it is scan's current markBatch */
> + if (batch == scan->batchringbuf.markBatch)
> + return;
> +
> + /* Drop TID recycling interlock (e.g., buffer pin) via amunguardbatch */
> + if (!scan->batchImmediateUnguard)
> + tableam_util_unguard_batch(scan, batch);
> +
> + /*
> + * Let the index AM set LP_DEAD bits in the index page, if applicable.
> + *
> + * batch.deadItems[] is now in whatever order the scan returned items in.
> + * We might have even saved the same item/TID twice.
> + *
> + * Sort and unique-ify deadItems[]. That way the index AM can safely
> + * assume that items will always be in their original index page order.
> + */
> + if (batch->numDead > 0 &&
> + scan->indexRelation->rd_indam->amkillitemsbatch != NULL)
> + {
> + if (batch->numDead > 1)
> + {
> + qsort(batch->deadItems, batch->numDead, sizeof(int),
> + batch_compare_int);
> + batch->numDead = qunique(batch->deadItems, batch->numDead,
> + sizeof(int), batch_compare_int);
> + }
> +
> + scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch);
> + }
> +
> + /*
> + * Use cache, just like indexam_util_batch_release does it (unless scan is
> + * shutting down)
> + */

Any reason to not use indexam_util_batch_release() to implement the release?
Seems nicer to not have two copies of this code.

> + if (scan->xs_heapfetch)
> + {
> + for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
> + {
> + if (scan->batchringbuf.cache[i] == NULL)
> + {
> + /* found empty slot, we're done */
> + scan->batchringbuf.cache[i] = batch;
> + return;
> + }
> + }
> + }
> +
> + if (batch->deadItems)
> + pfree(batch->deadItems);
> + pfree(batch_alloc_base(batch, scan));
> +}

> @@ -1078,13 +1081,19 @@ _bt_unlockbuf(Relation rel, Buffer buf)
> * Buffer is pinned and locked, which means that it is expected to be
> * defined and addressable. Check that proactively.
> */
> - VALGRIND_CHECK_MEM_IS_DEFINED(BufferGetPage(buf), BLCKSZ);
> +#if defined(USE_VALGRIND)
> + Page page = BufferGetPage(buf);
> +
> + VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
> +#endif
>
> /* LockBuffer() asserts that pin is held by this backend */
> LockBuffer(buf, BUFFER_LOCK_UNLOCK);
>
> +#if defined(USE_VALGRIND)
> if (!RelationUsesLocalBuffers(rel))
> - VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(buf), BLCKSZ);
> + VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
> +#endif
> }

Is this a leftover from a larger change earlier? Because otherwise it's not
obvious why this changed as part of this commit.

> @@ -119,78 +112,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
> /*
> * OK, now that we have what we need, fetch the next tuple.
> */
> - while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
> + while (table_index_getnext_slot(scandesc, direction,
> + node->ioss_TableSlot))
> {
> - bool tuple_from_heap = false;
> -
> CHECK_FOR_INTERRUPTS();

This looks so much better than before...

> --- a/src/backend/optimizer/path/indxpath.c
> +++ b/src/backend/optimizer/path/indxpath.c
> @@ -43,7 +43,7 @@
> /* Whether we are looking for plain indexscan, bitmap scan, or either */
> typedef enum
> {
> - ST_INDEXSCAN, /* must support amgettuple */
> + ST_INDEXSCAN, /* must support amgettuple or amgetbatch */
> ST_BITMAPSCAN, /* must support amgetbitmap */
> ST_ANYSCAN, /* either is okay */
> } ScanTypeControl;
> @@ -747,7 +747,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
> {
> IndexPath *ipath = (IndexPath *) lfirst(lc);
>
> - if (index->amhasgettuple)
> + if (index->amhasgetbatch)
> add_path(rel, (Path *) ipath);
>
> if (index->amhasgetbitmap &&
> @@ -835,7 +835,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
> switch (scantype)
> {
> case ST_INDEXSCAN:
> - if (!index->amhasgettuple)
> + if (!index->amhasgetbatch)
> return NIL;
> break;
> case ST_BITMAPSCAN:

Hm - I don't know this code at all, but I thought the old amhasgettuple
approach was supposed to transparently work as before (with perhaps an
exception for mark/restore)?

...

Ah:

> diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
> index b2fbd6a08..665ddca53 100644
> --- a/src/backend/optimizer/util/plancat.c
> +++ b/src/backend/optimizer/util/plancat.c
> @@ -310,11 +310,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
> info->amsearcharray = amroutine->amsearcharray;
> info->amsearchnulls = amroutine->amsearchnulls;
> info->amcanparallel = amroutine->amcanparallel;
> - info->amhasgettuple = (amroutine->amgettuple != NULL);
> + info->amhasgetbatch = (amroutine->amgetbatch != NULL ||
> + amroutine->amgettuple != NULL);

Setting amhasgetbatch when not having amgetbatch seems pretty darn confusing.
Why?

> --- a/src/backend/replication/logical/relation.c
> +++ b/src/backend/replication/logical/relation.c
> @@ -887,10 +887,12 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
> return false;
>
> /*
> - * The given index access method must implement "amgettuple", which will
> - * be used later to fetch the tuples. See RelationFindReplTupleByIndex().
> + * The given index access method must implement "amgettuple" or
> + * "amgetbatch", which will be used later to fetch the tuples. See
> + * RelationFindReplTupleByIndex().
> */
> - if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
> + if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
> + GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
> return false;
>
> return true;

I don't think this should do two GetIndexAmRoutineByAmId() calls for the same
AM.

> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c

Not looking at these changes, since they're afaict going to be refactored
further in a later commit. I guess we'll somehow have to resolve the ordering
for actual merging...

> @@ -102,7 +102,6 @@
> #include "access/gin.h"
> #include "access/table.h"
> #include "access/tableam.h"
> -#include "access/visibilitymap.h"
> #include "catalog/pg_collation.h"
> #include "catalog/pg_operator.h"
> #include "catalog/pg_statistic.h"

Yay.

I'll try to read through the rest soon. But now: Food.

Greetings,

Andres

In response to

Re: index prefetching at 2026-03-19 17:44:16 from Peter Geoghegan

Responses

Re: index prefetching at 2026-03-21 19:54:40 from Andres Freund
Re: index prefetching at 2026-03-21 23:01:44 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Shinya Kato	2026-03-21 02:04:32	Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
Previous Message	Melanie Plageman	2026-03-20 23:37:08	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)