Quick Links

Re: index prefetching

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2026-03-31 20:22:36
Message-ID:	chsvntdxvsiyigxq4nng36gne4natvxwvsqnkvbjlpaw6bu7co@a6togdo4wbrj
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-03-31 13:39:30 -0400, Peter Geoghegan wrote:
> On Fri, Mar 27, 2026 at 8:35 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Attached is v18, which addresses the tricky issues I skipped in v17.
>
> v19 is attached. It contains a few notable improvements over v18:
>
> * The first commit (the one that simply moves code into the new
> heapam_indexscan.c file) now moves heap_hot_search_buffer over there
> as well.
>
> Andres suggested this at one point, and I believe it wins us a small
> but significant performance benefit.

To me it also just makes sense.

> From 742dcbea7d184443d2c4ef3c095b60f47f870f72 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Wed, 25 Mar 2026 00:22:17 -0400
> Subject: [PATCH v19 01/18] Move heapam_handler.c index scan code to new file.
>
> Move the heapam index fetch callbacks (index_fetch_begin,
> index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
> dedicated file. Also move heap_hot_search_buffer over. This is a
> purely mechanical move with no functional impact.

I don't love that this renames arguments (s/call_again/heap_continue/) while
moving. Makes it harder to verify this is just the mechanical change it claims
to be.

> +/*-------------------------------------------------------------------------
> + *
> + * heapam_indexscan.c
> + * heap table plain index scan and index-only scan code
> + *
> + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + * src/backend/access/heap/heapam_indexscan.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/heapam.h"
> +#include "storage/predicate.h"

Should probably include access/relscan.h for +IndexFetchTableData, rather than
relying on the indirect include.

> From 5ca838ac32a8d43510325cb400be4ecea47ea8d2 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Sun, 22 Mar 2026 02:36:57 -0400
> Subject: [PATCH v19 02/18] Add slot-based table AM index scan interface.
>
> Add table_index_getnext_slot, a new table AM callback that wraps both
> plain and index-only index scans that use amgettuple. Two new
> TableAmRoutine callbacks are introduced -- one for plain scans and one
> for index-only scans -- which an upcoming commit that adds the
> amgetbatch interface will expand to four. The appropriate callback is
> resolved once in index_beginscan, and called through a function pointer
> (xs_getnext_slot) on the IndexScanDesc when the table_index_getnext_slot
> shim function is called from executor nodes.
>
> This moves VM checks for index-only scans out of the executor and into
> heapam, enabling batching of visibility map lookups (though for now we
> continue to just perform retail lookups).

I'd also add that it's architecturally much cleaner this way.

> A small minority of callers (2 callers in total) continue to use the old
> table_index_fetch_tuple interface. This is still necessary with callers
> that fundamentally need to pass a TID to the table AM. Both callers
> perform some kind of constraint enforcement.

I think we may need to do something about this before all of this over. It's
just too confusing to have the generic sounding index_fetch_tuple() interfaces
that are barely ever used and *shouldn't* be used when, uhm, fetching a tuple
pointed to by an index.

At the very least it seems we should rename the ->index_fetch_tuple() to
->index_fetch_tid() or something and remove the call_again, all_dead
arguments, since they are never used in this context. I suspect it should
*not* use the table_index_fetch_begin()/table_index_fetch_end() interface
either.

> XXX Do we still need IndexFetchTableData.rel? We are now mostly using
> IndexScanDescData.heapRelation, but we could use it for absolutely
> everything if we were willing to revise the signature of the old
> table_index_fetch_tuple interface by giving it its own heapRelation arg.

I think we *should* change the argument of table_index_fetch_tuple().

> Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
> (which the table AM needs internally for heap fetches), but continue to
> read their results from IndexScanDescData fields such as xs_itup (rather
> than from the slot itself).

Pretty this ain't. But I guess it's been vaguely gross like this for quite a
while, so...

> This convention lets both plain and index-only scans use the same
> table_index_getnext_slot interface.

Not convinced that's really a useful goal.

> All callers can continue to rely on the scan descriptor's xs_heaptid field
> being set on each call. (execCurrentOf is the only caller that reads this
> field directly, to determine the current row of an index-only scan, but it
> seems like a good idea to do the same thing for all callers).

> XXX Should execCurrentOf continue to do this? Can't it get the same TID
> from the table slot instead?

I think the tid from the slot is always the tid from the start of the HOT
chain, not the actual location of the tuple. That's important, as otherwise a
HOT pruning after determining the slot's tid would potentially make the tid
invalid. Whereas, I think, xs_heaptid, is currently set to the offset of the
actual tuple, even if it's a just a HOT version.

> @@ -177,10 +181,13 @@ typedef struct IndexScanDescData
> struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */
>
> ItemPointerData xs_heaptid; /* result */
> - bool xs_heap_continue; /* T if must keep walking, potential
> - * further results */
> IndexFetchTableData *xs_heapfetch;
>
> + /* Resolved getnext_slot implementation, set by index_beginscan */
> + bool (*xs_getnext_slot) (struct IndexScanDescData *scan,
> + ScanDirection direction,
> + struct TupleTableSlot *slot);
> +
> bool xs_recheck; /* T means scan keys must be rechecked */
>
> /*

Hm. Why is it ok that we now don't have an equivalent of xs_heap_continue on
the scan level anymore? I see there's a local heap_continue in
heapam_index_getnext_slot(), but isn't that insufficient e.g. for a
SNAPSHOT_ANY snapshot, where all the row versions would be visible?

Am I misunderstanding how this works?

> +/*
> + * Fetch the next tuple from an index scan into `slot`, scanning in the
> + * specified direction. Returns true if a tuple satisfying the scan keys and
> + * the snapshot was found, false otherwise. The tuple is stored in the
> + * specified slot.
> + *
> + * Dispatches through scan->xs_getnext_slot, which is resolved once by
> + * index_beginscan.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * released by a future table_index_getnext_slot or index_endscan call.
> + *
> + * Note: caller must check scan->xs_recheck, and perform rechecking of the
> + * scan keys if required. We do not do that here because we don't have
> + * enough information to do it efficiently in the general case.
> + *
> + * For index-only scans, the callback also fills xs_itup/xs_itupdesc or
> + * xs_hitup/xs_hitupdesc (or both) so that index data can be returned without
> + * a heap fetch.
> + */
> +static inline bool
> +table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
> + TupleTableSlot *slot)
> +{
> + return iscan->xs_getnext_slot(iscan, direction, slot);
> +}

Hm. Does this actually belong in tableam.h? I'm a bit conflicted.

> @@ -2553,6 +2554,8 @@ static const TableAmRoutine heapam_methods = {
> .index_fetch_begin = heapam_index_fetch_begin,
> .index_fetch_reset = heapam_index_fetch_reset,
> .index_fetch_end = heapam_index_fetch_end,
> + .index_plain_amgettuple_getnext_slot = heapam_index_plain_amgettuple_getnext_slot,
> + .index_only_amgettuple_getnext_slot = heapam_index_only_amgettuple_getnext_slot,
> .index_fetch_tuple = heapam_index_fetch_tuple,
>
> .tuple_insert = heapam_tuple_insert,

Wonder if it's perhaps worth deleting _slot and shortening getnext to
next. These are quite only names...

> +/* table_index_getnext_slot callback: amgettuple, plain index scan */
> +pg_attribute_hot bool
> +heapam_index_plain_amgettuple_getnext_slot(IndexScanDesc scan,
> + ScanDirection direction,
> + TupleTableSlot *slot)
> +{
> + Assert(!scan->xs_want_itup);
> + Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> + return heapam_index_getnext_slot(scan, direction, slot, false);
> +}
> +
> +/* table_index_getnext_slot callback: amgettuple, index-only scan */
> +pg_attribute_hot bool
> +heapam_index_only_amgettuple_getnext_slot(IndexScanDesc scan,
> + ScanDirection direction,
> + TupleTableSlot *slot)
> +{
> + Assert(scan->xs_want_itup);
> + Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> + return heapam_index_getnext_slot(scan, direction, slot, true);
> +}
> +
> bool
> heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
> ItemPointer tid,
> @@ -233,6 +274,18 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
> TupleTableSlot *slot,
> bool *heap_continue, bool *all_dead)
> {
> +
> + return heapam_index_fetch_tuple_impl(scan, tid, snapshot, slot,
> + heap_continue, all_dead);
> +}
> +
> +static pg_attribute_always_inline bool
> +heapam_index_fetch_tuple_impl(struct IndexFetchTableData *scan,
> + ItemPointer tid,
> + Snapshot snapshot,
> + TupleTableSlot *slot,
> + bool *heap_continue, bool *all_dead)
> +{
> IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
> BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
> bool got_heap_tuple;
> @@ -289,3 +342,172 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>
> return got_heap_tuple;
> }
> +
> +/*
> + * Common implementation for both heapam_index_*_getnext_slot variants.
> + *
> + * The result is true if a tuple satisfying the scan keys and the snapshot was
> + * found, false otherwise. The tuple is stored in the specified slot.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * dropped by a future call here (or by a later call to heapam_index_fetch_end
> + * through index_endscan).
> + *
> + * The index_only parameter is a compile-time constant at each call site,
> + * allowing the compiler to specialize the code for each variant.
> + */
> +static pg_attribute_always_inline bool
> +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> + TupleTableSlot *slot, bool index_only)

For heapam_index_fetch_tuple_impl() you added _impl, but here you didn't.
Mild preference for also adding _impl here.

> +{
> + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> + ItemPointer tid;
> + bool heap_continue = false;

As mentioned above, it's not clear to me that it is correct for all scan types
to have a short-lived heap_continue.

> + bool all_visible = false;
> + BlockNumber last_visited_block = InvalidBlockNumber;
> + uint8 n_visited_pages = 0,
> + xs_visited_pages_limit = 0;
> +
> + if (index_only)
> + xs_visited_pages_limit = scan->xs_visited_pages_limit;

Did you check that it's useful to have xs_visited_pages_limit as a longer
lived variable? I suspect it's just going to add register pressure and will
need to be saved/restored across other calls, making it just as cheap to
always fetch it from scan.

> + for (;;)
> + {
> + if (!heap_continue)
> + {
> + /* Get the next TID from the index */
> + tid = index_getnext_tid(scan, direction);
> +
> + /* If we're out of index entries, we're done */
> + if (tid == NULL)
> + break;
> +
> + /* For index-only scans, check the visibility map */
> + if (index_only)
> + all_visible = VM_ALL_VISIBLE(scan->heapRelation,
> + ItemPointerGetBlockNumber(tid),
> + &hscan->xs_vmbuffer);
> + }
> +
> + Assert(ItemPointerIsValid(&scan->xs_heaptid));
> +
> + if (index_only)
> + {
> + /*
> + * We can skip the heap fetch if the TID references a heap page on
> + * which all tuples are known visible to everybody. In any case,
> + * we'll use the index tuple not the heap tuple as the data
> + * source.
> + */
> + if (!all_visible)
> + {
> + /*
> + * Rats, we have to visit the heap to check visibility.
> + */
> + if (scan->instrument)
> + scan->instrument->ntablefetches++;
> +
> + if (!heapam_index_fetch_heap(scan, slot, &heap_continue))
> + {

Still find it somewhat weird that we have local 'tid' variable, that we do not
use pass to heapam_index_fetch_heap(), because heapam_index_fetch_heap()
relies on index_getnext_tid() having stored the to-be-fetched tid in
xs_heaptid.

But I guess that's something to change at a later date.

> + /*
> + * No visible tuple. If caller set a visited-pages limit
> + * (only selfuncs.c does this), count distinct heap pages
> + * and give up once we've visited too many.
> + */
> + if (unlikely(xs_visited_pages_limit > 0))
> + {
> + BlockNumber blk = ItemPointerGetBlockNumber(tid);
> +
> + if (blk != last_visited_block)
> + {
> + last_visited_block = blk;
> + if (++n_visited_pages > xs_visited_pages_limit)
> + return false; /* give up */
> + }
> + }
> + continue; /* no visible tuple, try next index entry */
> + }
> +
> + ExecClearTuple(slot);

Previously the ExecClearTuple() here had at least this comment:
/* We don't actually need the heap tuple for anything */

Now it looks even more confusing...

> @@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk,
> * since we don't lock the visibility map page either, it's even possible that
> * someone else could have changed the bit just before we look at it, but yet
> * we might see the old value. It is the caller's responsibility to deal with
> - * all concurrency issues!
> + * all concurrency issues! In practice it can't be stale enough to matter for
> + * the primary use case: index-only scans that check whether a heap fetch can
> + * be skipped.
> + *
> + * The argument for why it can't be stale enough to matter for the primary use
> + * case is as follows:
> + *
> + * Inserts: we need to detect that a VM bit was cleared by an insert right
> + * away, because the new tuple is present in the index but not yet visible.
> + * Reading the TID from the index page (under a shared lock on the index
> + * buffer) is serialized with the insertion of the TID into the index (under
> + * an exclusive lock on the same index buffer). Because the VM bit is cleared
> + * before the index is updated, and locking/unlocking of the index page acts
> + * as a full memory barrier, we are sure to see the cleared bit whenever we
> + * see a recently-inserted TID.
> + *
> + * Deletes: the clearing of the VM bit by a delete is NOT serialized with the
> + * index page access, because deletes do not update the index page (only
> + * VACUUM removes the index TID). So we may see a significantly stale value.
> + * However, we don't need to detect the delete right away, because the tuple
> + * remains visible until the deleting transaction commits or the statement
> + * ends (if it's our own transaction). In either case, the lock on the VM
> + * buffer will have been released (acting as a write barrier) after clearing
> + * the bit. And for us to have a snapshot that includes the deleting
> + * transaction (making the tuple invisible), we must have acquired
> + * ProcArrayLock after that time, acting as a read barrier.
> */
> uint8
> visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)

Nice. Much better place for the comment.

> diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
> index 1408989c5..acc9f3e6a 100644
> --- a/src/backend/access/index/genam.c
> +++ b/src/backend/access/index/genam.c
> @@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
> scan->xs_hitup = NULL;
> scan->xs_hitupdesc = NULL;
>
> + scan->xs_visited_pages_limit = 0;
> +
> return scan;
> }

Orthogonal: I suspect eventually we should pass the table to
RelationGetIndexScan(), have the tableam return how much space it needs, and
allocate it one chunk.

> From 85be7bf520fc2746f4ee73a0744c7aa04da55e52 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Tue, 10 Mar 2026 14:40:35 -0400
> Subject: [PATCH v19 03/18] heapam: Track heap block in IndexFetchHeapData.

LGTM.

> Subject: [PATCH v19 04/18] heapam: Keep buffer pins across index rescans.
>
> Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
> (xs_vmbuffer) during heapam_index_fetch_reset. Retaining these pins
> saves cycles during tight nested loop joins and merge joins that
> frequently restore a saved mark, since the next tuple fetched after a
> rescan often falls on the same heap page. It can also avoid repeated
> pinning and unpinning of the same buffer when rescans happen to revisit
> the same page.
>
> Preparation for an upcoming patch that will add the amgetbatch
> interface to enable optimizations such as I/O prefetching.
>
> Author: Peter Geoghegan <pg(at)bowt(dot)ie>
> Reviewed-By: Andres Freund <andres(at)anarazel(dot)de>
> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
> ---
> src/backend/access/heap/heapam_indexscan.c | 27 ++++++++++++----------
> src/backend/access/index/indexam.c | 6 ++---
> 2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
> index 4b5e2b30d..82f135e1e 100644
> --- a/src/backend/access/heap/heapam_indexscan.c
> +++ b/src/backend/access/heap/heapam_indexscan.c
> @@ -59,18 +59,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
> {
> IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>
> - if (BufferIsValid(hscan->xs_cbuf))
> - {
> - ReleaseBuffer(hscan->xs_cbuf);
> - hscan->xs_cbuf = InvalidBuffer;
> - hscan->xs_blk = InvalidBlockNumber;
> - }
> + /* Resets are a no-op (XXX amgetbatch commit resets xs_vm_items here) */
> + (void) hscan;

I assume that's not a line you intend to commit?

LGTM otherwise.

Need to do something else for a while. More another time.

Greetings,

Andres Freund

In response to

Re: index prefetching at 2026-03-31 17:39:30 from Peter Geoghegan

Responses

Re: index prefetching at 2026-04-01 15:51:19 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2026-03-31 20:36:32	Re: docs: warn about post-data-only schema dumps with parallel restore.
Previous Message	David G. Johnston	2026-03-31 20:04:21	Re: docs: warn about post-data-only schema dumps with parallel restore.