Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2026-04-05 05:37:02
Message-ID:	CAH2-Wzm9r6sCDVkXHXLobQFE3WcEPf-pm=g6S93u4W9JLK2VDA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Attached is v23, which addresses your (Andres') most recent review
feedback. I respond to the points that you brought up in your review
below.

Andres just pushed numerous related read stream patches (thanks for
your work on that!), so I'm down to only 8 patches. This includes a
new version of "Allow read_stream_reset() to not wait for IO
completion" (which uses a less "hacky" approach than previous versions
of that patch).

Note that I reordered the patches for v23, to put the more committable
ones first. I already committed 4 of the v22 patches (that I myself
authored) that were deemed ready earlier today.

The patches I'm posting now are also in a slightly different order to
the one used by prior revisions: they put the hash index commit
earlier. I think it makes sense to commit that the hash index patch
immediately after the main amgetbatch commit gets in. Prefetching
itself depends on at least one moew pending patch from you, so that
might well need to go in last.

On Sat, Apr 4, 2026 at 1:21 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Subject: [PATCH v21 01/15] Rename heapam_index_fetch_tuple argument for
> > clarity.

> > Subject: [PATCH v21 02/15] Move heapam_handler.c index scan code to new file.

> Any reason to hold off pushing these?

Nope, no reason. Pushed those 2, plus the "Track heap block in
IndexFetchHeapData" and "Keep buffer pins" patches.

> > Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
> > (which the table AM needs internally for heap fetches),
>
> FWIW, that parenthetical does not convince me. Storing tuples in a slot is
> more expensive than just doing so internally in a non-generic datastructure.

I changed the wording from "needs" to "uses". Was that all you meant?

It's unclear if you're asking (or at least suggesting) that I should
invent a new table_index_getnext_slot-like shim, used only during
index-only scans, that doesn't require passing a slot (but does
require getting a table slot some other way, likely at scan startup).
I can see the argument for that. I've only avoided it because it would
require adding more mechanism to the generic index_beginscan path
(though I guess we could invent a new index_beginscan_indexonly
similar to index_beginscan_bitmap).

> At first I didn't like very much that now the tableam is responsible for
> determining this, but thinking about it more, it makes sense - it very well
> could be that a tableam could benefit from choosing a xs_getnext_slot
> implementation depending on things only it itself knows.

Cool.

> I'd strongly prefer forward declaring
> typedef struct IndexScanDescData *IndexScanDesc;
> over including genam.h into tableam.h

Fixed.

> > - struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
> > + struct IndexFetchTableData *(*index_fetch_begin) (IndexScanDesc scan,
> > + uint32 flags);
>
> I kinda wonder if this should look roughly like this instead:
>
> /* size of the tableam's IndexFetchTableData */
> size_t index_fetch_state_size;
> void *(*index_fetch_begin) (Relation rel, IndexScanDesc scan, IndexFetchTableData *table_fetch, uint32 flags);
>
> Moving the allocation of the IndexFetchTableData to the caller. The advantage
> would be that this would allow us to later combine the allocation of the
> IndexScanDesc with the IndexFetchTableData, without again needing to change
> the tableam interface.

Hmmm. So index_fetch_state_size is a field in the TableAmRoutine
struct, that is set statically by the table AM?

> s/, as needed when fetching tuples for/as part of/

Fixed.

> I'd just say "Release resources and deallocate the IndexFetchTableData in the
> scan.".

Fixed.

> s/don't want to provide their own slot/that just need to check if a tuple is visible/

Fixed.

> This is quite a spectacularly ugly indentation. Not your fault. Still can't
> stand it :)

I was torn between two different neatnik impulses: the desire to have
consistent function prototypes (that all appear in the same order as
their definitions!), versus the desire for acceptable indentation
without any ludicrously long lines.

> I'd probably just reorder the function definitions so you don't need the
> forward declaration for the pg_attribute_always_inline functions.

Okay, sure. I've done it that way; reordered things such that I was
able to remove function prototypes with pg_attribute_always_inline
only.

> > +static pg_attribute_always_inline bool
> > +heapam_index_fetch_tuple_impl(Relation rel,
> > + IndexFetchHeapData *hscan,
> > + ItemPointer tid,
> > + Snapshot snapshot,
> > + TupleTableSlot *slot,
> > + bool *heap_continue, bool *all_dead)
>
> I'd probably put all of these just after heapam_index_fetch_end(), given they
> are basically part of the implementation of heapam_index_*.

I don't follow.

Note that I renamed heapam_index_fetch_tuple to
heapam_index_fetch_tuple and renamed heapam_index_fetch_heap to
heapam_index_fetch_heap_item in the attached revision. I think that
those names are slightly clearer. The latter function is just a
wrapper around the former that performs some extra steps that are
secondary to fetching a heap tuple.

> Or you just move the definition of
> index_beginscan_internal() to before index_beginscan(), and you don't need the
> prototype anymore. Which imo is nicer, because you don't need to update as
> many places anymore when changing the signature.

Fixed it that way, much like with heapam_indexscan.c prototypes.

> > - /*
> > - * Only MVCC snapshots are supported here, so there should be no
> > - * need to keep following the HOT chain once a visible entry has
> > - * been found. If we did want to allow that, we'd need to keep
> > - * more state to remember not to call index_getnext_tid next time.
> > - */
> > - if (scandesc->xs_heap_continue)
> > - elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
>
> So that's now only an assertion, right? Probably ok...

Yes. But the new assertion is *less* strict than this defensive elog
was: it tolerates heap_continue being set iff a non-MVCC snapshot is
used.

It has to work that way now, I think. The new index-only scan code
cannot assume that a non-MVCC snapshot will be used in all cases -- we
have the selfuncs.c caller to consider (which uses
SnapshotNonVacuumable) now. Obviously that never affected the old
nodeIndexonlyScan.c code quoted here.

Actually, I don't know if we really *should* treat the selfuncs.c
caller any different (except as needed to avoid spurious assertion
failures). The existing selfuncs.c handling assumes that the first
heap tuple returned from a HOT chain is always an acceptable tuple to
return to the scan. That is, it doesn't care about later HOT chain
members.

> Given that all the index_beginscan* callers used onlly one argument per line,
> I'd also do that for the ios=false.

I fixed the affected index_beginscan* callers, by following your
suggested approach.

> > void
> > @@ -92,7 +88,13 @@ heapam_index_fetch_end(IndexScanDesc scan)
> > {
> > IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> >
> > - heapam_index_fetch_reset(scan);
> > + /* drop pin if there's a pinned heap page */
> > + if (BufferIsValid(hscan->xs_cbuf))
> > + ReleaseBuffer(hscan->xs_cbuf);
> > +
> > + /* drop pin if there's a pinned visibility map page */
> > + if (BufferIsValid(hscan->xs_vmbuffer))
> > + ReleaseBuffer(hscan->xs_vmbuffer);
> >
> > pfree(hscan);
> > }
>
> Personally I'd keep the clearing of the Buffer fields after the release, but I
> do admit that's rank paranoia.

The committed patch didn't do that. I just don't see a point here,
given that we're directly pfree'ing the memory that you're suggested
need to be reset.

Besides, if I did this, wouldn't I also have to clear xs_blk by
resetting it to InvalidBlockNumber? That's what it would take to make
it safe if we weren't just pfree'ing the struct right away.

> > This expands the slot-based table AM interface added by commit FIXME
> > from two callbacks (for amgettuple plain and index-only scans) to four,
> > adding amgetbatch variants for both scan types.

> I think this paragraph is a bit out of date.

Fixed.

> > With amgetbatch, a scan-level policy determines whether each batch's
> > index page buffer pin is dropped eagerly by the index AM (for plain
> > scans with an MVCC snapshot, where the snapshot itself prevents TID
> > recycling problems) or retained as an interlock against concurrent TID
> > recycling by VACUUM. The interlock is retained for non-MVCC scans and
>
> s/non-MVCC/plain non-MVCC/

Fixed.

> > This extends the dropPin mechanism added to nbtree by commit 2ed5b87f,
> > and generalizes it to work with all index AMs that support the new
> > amgetbatch interface (LP_DEAD marking of index entries must be performed
>
> It's a bit funny to say that it expands it to all index AMs that support the
> new amgetbatch, given that that's just nbtree as of this commit :)

Plan now is to commit the hash index patch immediately after this one.
Maybe that makes it clearer?

> > +
> > } IndexFetchHeapData;
>
> Probably unintentional trailing empty line in the struct.

Fixed.

> I don't think the table AM implementation should need to know about the
> allocation base of a batch and similar implementation details.

But they have to, if only to be able to implement their own
heapam_index_fetch_batch_init-like function.

> And index_scan_batch_base()'s comment are now a bit too limited:

> Hm. Does all of this actually need to live in heapam.h? Seems like it could
> just be in heapam_indexscan.c.

Agreed. Fixed.

> > + /*
> > + * Theoretically we should set knownEndForward/knownEndBackward to
> > + * false (whichever is used when moving in the opposite direction)
> > + * when this is the scan's first returned batch. We don't bother
> > + * because the index AM should always record that fact in its own
> > + * opaque area. (These fields only exist because we don't want index
> > + * AMs setting _any_ field from any priorbatch that we pass to them.
> > + * Besides, it would be cumbersome for index AMs to keep track of
> > + * which batch is the current amgetbatch call's original priorbatch.)
> > + */
>
> Why aren't we setting them to false?

It offers no practical benefit (no query can possibly read extra index
pages as a result of the omission), and I'm paranoid (perhaps
unreasonably so) about adding cycles here. Especially with pgbench
SELECT style point queries.

> > +/*
> > + * utilities called by index AMs
> > + */
>
> I'd make these section comments a bit more prominent (like e.g. in tableam.h,
> otherwise they're quite hard to distinguish from function comments.

Fixed.

> > +/*
> > + * Data about one batch of items returned by (and passed to) amgetbatch during
> > + * index scans.
> > + *
> > + * Each batch allocation has the following memory layout:
> > + *
> > + * [table AM opaque area] <- fixed-size, -(batch_table_offset) from base
> > + * [index AM opaque area] <- at -(batch_index_opaque_size) from base
> > + * [IndexScanBatchData] <- base pointer, returned by amgetbatch
>
> This conflicts a bit with how e.g. index_scan_batch_base() uses base pointer.

Good point. Fixed in latest version, by using "allocation base"
terminology in these comments.

> > .index_fetch_begin = heapam_index_fetch_begin,
> > + .index_fetch_batch_init = heapam_index_fetch_batch_init,
> > .index_fetch_reset = heapam_index_fetch_reset,
> > .index_fetch_end = heapam_index_fetch_end,
> > + .index_fetch_markpos = heapam_index_fetch_markpos,
> > + .index_fetch_restrpos = heapam_index_fetch_restrpos,
>
> Given that the ->index_fetch_begin() now is responsible for configuring
> IndexScanDesc->xs_getnext_slot, it's not clear to me that it really makes
> sense to have stuff like ->index_fetch_batch_init() be set in TableAmRoutine.
>
> Wonder if we should just have a TableIndexFetchRoutine struct embedded in the
> IndexScanDesc that ->index_fetch_begin needs to fully initialize.

I'm glad that you came around to my point of view on it being a good
idea to allow the table AM to do more in ->index_fetch_begin.
Specifically, allowing that callback to pick its preferred slot makes
sense. But I don't think the same argument I made works for these
other new callbacks, like ->index_fetch_batch_init?

Or were you making another argument?

> What's the point of fetching hscan and then using a (void) hscon to avoid a
> warning?

Fixed in committed patch.

> > + return heapam_index_return_scanpos_tid(scan, hscan, direction,
> > + scanBatch, scanPos, all_visible);
> > +}
>
> Still looks like most of this ought to be in a tableam_util_...

Why?

Code reuse is a virtue, but it isn't the highest virtue. Splitting up
this function is possible but quite awkward.

> > + pgstat_count_index_tuples(scan->indexRelation, 1);
>
> The asymmetry of where pgstat_count_index_tuples() is called (i.e. here for
> amgetbatch(), in index_getnext_tid() for amgettuple) still seems weird to me.

It's also kinda weird that index_getnext_tid can call
table_index_fetch_reset, even though index_getnext_tid is only really
supposed to be called by table AMs now (and is only called by heapam
in practice).

What do you want to do about it? index_getnext_tid isn't adding too
much right now, I'd say.

> > + allbatchitemsvisible = lastSetItem <= batch->firstItem &&
> > + (posItem == batch->lastItem ||
> > + (hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED));
> > + }
>
> Might be worth putting into a static inline.

Why do you say that? I don't follow.

> Can'tt the *2 lead to computing a xs_vm_items bigger than scan->maxitemsbatch?

Yes, it can.

> > +tableam_util_batchscan_end(IndexScanDesc scan)
> > +{

> > + pfree(index_scan_batch_base(scan, cached));
>
> I'd set scan->batchcache[i] to NULL after freeing.

I tried that out, but I *think* it regressed pgbench SELECT by about
~1.5% of total TPS (I'd need more thorough testing to confirm this).
Are you sure we need this?

> > + /*
> > + * index_scan_batch_loaded indicates that markPos->batch is loaded,
> > + * but after uint8 overflow a stale batch offset can alias a
>
> Maybe s/overflow/wraparound/?

Fixed.

> > + * currently-loaded range (false positive). Confirm by checking
> > + * whether the batch pointer in markPos->batch's slot still matches.
> > + */
> > + freeMarkBatch = (index_scan_batch(scan, markPos->batch) != markBatch);
>
> Perhaps this should be be integrated into index_scan_batch_loaded()?

I doubt that would be practical, because there is an important
difference between a batch being immediately available (meaning it can
read it from memory without calling amgetbatch again) and a batch
actually being in the ring buffer. markBatch is a special case in that
it can be in the former category without also being in the latter one.

Most callers really don't need to think about that distinction, and it
seems actively bad if index_scan_batch_loaded knows about the scan's
saved markBatch in the context of these callers.

> > +/*
> > + * Restore mark to scanPos position
>
> This sounds a bit like the mark is restored into the position of the scan pos,
> which obviously makes no sense.

Fixed.

> > + if (batch->numDead > 0 &&
> > + scan->indexRelation->rd_indam->amkillitemsbatch != NULL)
> > + {
> > + if (batch->numDead > 1)
> > + {
> > + qsort(batch->deadItems, batch->numDead, sizeof(int),
> > + batch_compare_int);
> > + batch->numDead = qunique(batch->deadItems, batch->numDead,
> > + sizeof(int), batch_compare_int);
> > + }
> > +
> > + scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch);
> > + }
>
> Based on the name or function header one sure wouldn't think this will call
> indexam callbacks...

I dunno, how do you think it should work?

To me it makes sense to call amkillitemsbatch here, because when the
table AM frees a batch it clearly has no further need to perform heap
fetches for any items in that batch. The table-AM-focussed function
name seems fine to me because this is one extra step, that happens to
involve calling the index AM.

Bear in mind that the call to amkillitemsbatch only does what the AM
table instructs. It has no real awareness of any ongoing scan. We
always use the LSN trick there now, so it's really not coordinating
with itself/with the table AM.

> > - scan->kill_prior_tuple = false; /* for safety */
> > scan->xs_heap_continue = false;
> >
> > - scan->indexRelation->rd_indam->amrestrpos(scan);
> > + table_index_fetch_restrpos(scan);
> > }
>
> Did we loose that kill_prior_tuple? Or is there a reason it's not necessary
> anymore? Too tired to think through this.

We still need kill_prior_tuple, but only for amgettuple scans. Those
don't support mark + restore anymore.

--
Peter Geoghegan

Attachment	Content-Type	Size
v23-0004-heapam-Optimize-pin-transfers-during-index-scans.patch	application/octet-stream	6.8 KB
v23-0005-heapam-Add-index-scan-I-O-prefetching.patch	application/octet-stream	48.3 KB
v23-0007-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch	application/octet-stream	6.3 KB
v23-0006-Allow-read_stream_reset-to-not-wait-for-IO-compl.patch	application/octet-stream	20.5 KB
v23-0001-Add-slot-based-table-AM-index-scan-interface.patch	application/octet-stream	79.6 KB
v23-0008-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch	application/octet-stream	3.1 KB
v23-0002-Add-amgetbatch-interface-and-adopt-it-in-nbtree.patch	application/octet-stream	240.2 KB
v23-0003-Adopt-amgetbatch-interface-in-hash-index-AM.patch	application/octet-stream	48.0 KB

In response to

Re: index prefetching at 2026-04-04 05:21:53 from Andres Freund

Responses

Re: index prefetching at 2026-04-05 23:01:53 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2026-04-05 05:48:10	Re: Better shared data structure management and resizable shared data structures
Previous Message	Andres Freund	2026-04-05 05:24:02	Re: AIO / read stream heuristics adjustments for index prefetching