Quick Links

Re: index prefetching

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2026-04-05 23:01:53
Message-ID:	6sphk3ycctmbihlrykts7uj6mjakop6wrq2dhe3vnlmrnldz2f@uuwmkd6jjrxa
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-04-05 01:37:02 -0400, Peter Geoghegan wrote:
> The patches I'm posting now are also in a slightly different order to
> the one used by prior revisions: they put the hash index commit
> earlier. I think it makes sense to commit that the hash index patch
> immediately after the main amgetbatch commit gets in. Prefetching
> itself depends on at least one moew pending patch from you, so that
> might well need to go in last.

Makes sense.

> > > Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
> > > (which the table AM needs internally for heap fetches),
> >
> > FWIW, that parenthetical does not convince me. Storing tuples in a slot is
> > more expensive than just doing so internally in a non-generic datastructure.
>
> I changed the wording from "needs" to "uses". Was that all you meant?

It's better than nothing :)

> It's unclear if you're asking (or at least suggesting) that I should
> invent a new table_index_getnext_slot-like shim, used only during
> index-only scans, that doesn't require passing a slot (but does
> require getting a table slot some other way, likely at scan startup).

Why would it require creating a tableam shaped slot internally? It seems
rather silly to store a tuple fetched to verify visibility into a slot, where
it is never used. Storing in a tuple requires an indirect things like
IncrBufferRefCount(), which isn't that cheap.

heapam_index_getnext_slot(index_only=>1)
-> heapam_index_fetch_heap_item()
-> heapam_index_fetch_tuple()

right now does heap_hot_search_buffer() and then, if that found something,
stores the tuple in the slot. But there's no reason to do that if the tuple
isn't actually desired by the caller. So there's afaict simply no need to
have the slot.

Just passing index_only down and skipping storing the tuple in the slot in an
IOS seems to work. Passing a NULL slot needs a bit more work, but not a whole
lot.

So I guess yes, I am suggesting that we have a separate
table_index_getnext_ios() or such, which is only usable for index only scans.

> I can see the argument for that. I've only avoided it because it would
> require adding more mechanism to the generic index_beginscan path
> (though I guess we could invent a new index_beginscan_indexonly
> similar to index_beginscan_bitmap).

It probably would be cleaner to have index_beginscan_indexonly(), yea.

Not sure why it would need a meaningful amount of new mechanism for beginscan?
I'd expect the assertions for ->xs_want_itup in the table_index_getnext_ios()
wrapper and the opposite in table_index_getnext_slot().

I could see passing an ine shaped slot to table_index_getnext_ios(), which
would improve the layering for nodeIndexonlyscan.c further. It's a bit gross
that there's this stuff in IndexOnlyNext()

/*
* Fill the scan tuple slot with data from the index. This might be
* provided in either HeapTuple or IndexTuple format. Conceivably an
* index AM might fill both fields, in which case we prefer the heap
* format, since it's probably a bit cheaper to fill a slot from.
*/
if (scandesc->xs_hitup)
{
/*
* We don't take the trouble to verify that the provided tuple has
* exactly the slot's format, but it seems worth doing a quick
* check on the number of fields.
*/
Assert(slot->tts_tupleDescriptor->natts ==
scandesc->xs_hitupdesc->natts);
ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
}
else if (scandesc->xs_itup)
StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
else
elog(ERROR, "no data returned for index-only scan");

and

/*
* We expect that the index will return data in IndexTuple not
* HeapTuple format.
*/
if (!index_scan->xs_itup)
elog(ERROR, "no data returned for index-only scan");

in get_actual_variable_endpoint().

But I'd probably defer that part for now.

> > > - struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
> > > + struct IndexFetchTableData *(*index_fetch_begin) (IndexScanDesc scan,
> > > + uint32 flags);
> >
> > I kinda wonder if this should look roughly like this instead:
> >
> > /* size of the tableam's IndexFetchTableData */
> > size_t index_fetch_state_size;
> > void *(*index_fetch_begin) (Relation rel, IndexScanDesc scan, IndexFetchTableData *table_fetch, uint32 flags);
> >
> > Moving the allocation of the IndexFetchTableData to the caller. The advantage
> > would be that this would allow us to later combine the allocation of the
> > IndexScanDesc with the IndexFetchTableData, without again needing to change
> > the tableam interface.
>
> Hmmm. So index_fetch_state_size is a field in the TableAmRoutine
> struct, that is set statically by the table AM?

Yea.

> > > - /*
> > > - * Only MVCC snapshots are supported here, so there should be no
> > > - * need to keep following the HOT chain once a visible entry has
> > > - * been found. If we did want to allow that, we'd need to keep
> > > - * more state to remember not to call index_getnext_tid next time.
> > > - */
> > > - if (scandesc->xs_heap_continue)
> > > - elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
> >
> > So that's now only an assertion, right? Probably ok...
>
> Yes. But the new assertion is *less* strict than this defensive elog
> was: it tolerates heap_continue being set iff a non-MVCC snapshot is
> used.
>
> It has to work that way now, I think. The new index-only scan code
> cannot assume that a non-MVCC snapshot will be used in all cases -- we
> have the selfuncs.c caller to consider (which uses
> SnapshotNonVacuumable) now. Obviously that never affected the old
> nodeIndexonlyScan.c code quoted here.

That seems ok to me, if you somehow call this stuff with a non-mvcc snapshot
without knowing what you're doing, well, that's too bad for you.

> Actually, I don't know if we really *should* treat the selfuncs.c
> caller any different (except as needed to avoid spurious assertion
> failures). The existing selfuncs.c handling assumes that the first
> heap tuple returned from a HOT chain is always an acceptable tuple to
> return to the scan. That is, it doesn't care about later HOT chain
> members.

I don't think it's worth assuming that that's the code in the core
infrstructure.

> > I don't think the table AM implementation should need to know about the
> > allocation base of a batch and similar implementation details.
>
> But they have to, if only to be able to implement their own
> heapam_index_fetch_batch_init-like function.

I don't follow. Of course it needs to know the size of its own private data,
but it doesn't itself need to know how to get from IndexScanBatch to that.

> > > .index_fetch_begin = heapam_index_fetch_begin,
> > > + .index_fetch_batch_init = heapam_index_fetch_batch_init,
> > > .index_fetch_reset = heapam_index_fetch_reset,
> > > .index_fetch_end = heapam_index_fetch_end,
> > > + .index_fetch_markpos = heapam_index_fetch_markpos,
> > > + .index_fetch_restrpos = heapam_index_fetch_restrpos,
> >
> > Given that the ->index_fetch_begin() now is responsible for configuring
> > IndexScanDesc->xs_getnext_slot, it's not clear to me that it really makes
> > sense to have stuff like ->index_fetch_batch_init() be set in TableAmRoutine.
> >
> > Wonder if we should just have a TableIndexFetchRoutine struct embedded in the
> > IndexScanDesc that ->index_fetch_begin needs to fully initialize.
>
> I'm glad that you came around to my point of view on it being a good
> idea to allow the table AM to do more in ->index_fetch_begin.

> Specifically, allowing that callback to pick its preferred slot makes
> sense. But I don't think the same argument I made works for these
> other new callbacks, like ->index_fetch_batch_init?

Yea, I guess it's less likely you'd have multiple implementations of that (I
can come up with some cases where you might want to, but I don't think they're
particularly likely to come up)

> Or were you making another argument?

I guess I was just a bit saddened by
indexam_util_batch_alloc()
->table_index_fetch_batch_init()
having to do

scan->heapRelation->rd_tableam->index_fetch_batch_init(scan, batch, new_alloc);

that's a lot of indirection to go through.

Which is what then made me think of embedding something like a
TableIndexFetchRoutine (with batch_init, reset, markpos, restrpos, end
members) in IndexScanDesc. That way table_index_{fetch_reset,batch_Init,...}()
wouldn't need this level of indirection.

> > > + return heapam_index_return_scanpos_tid(scan, hscan, direction,
> > > + scanBatch, scanPos, all_visible);
> > > +}
> >
> > Still looks like most of this ought to be in a tableam_util_...
>
> Why?
>
> Code reuse is a virtue, but it isn't the highest virtue. Splitting up
> this function is possible but quite awkward.

Not having the same code in multiple table AMs would make it easier to write
them, which would be nice in itself, but I am more concerned with ending up
with copies of heapam_index_getnext_scanbatch_pos() in various AMs, us finding
problems with heapam_index_getnext_scanbatch_pos() in a minor release, and
having a harder time adjusting things due to the copied code.

> > > + pgstat_count_index_tuples(scan->indexRelation, 1);
> >
> > The asymmetry of where pgstat_count_index_tuples() is called (i.e. here for
> > amgetbatch(), in index_getnext_tid() for amgettuple) still seems weird to me.
>
> It's also kinda weird that index_getnext_tid can call
> table_index_fetch_reset, even though index_getnext_tid is only really
> supposed to be called by table AMs now (and is only called by heapam
> in practice).

Hm.

> What do you want to do about it? index_getnext_tid isn't adding too
> much right now, I'd say.

It does indeed not add a whole lot.

I think we should at least rename it, if there are extensions calling it, they
should probabloy not continue to do so.

I would move it into a tableam_util_fetch_next_tid() (or such) helper, I
think. That also makes it easier to get rid of the table_index_fetch_reset()
and move the pgstat_count_index_tuples(scan->indexRelation, 1) into one place.

> > > + allbatchitemsvisible = lastSetItem <= batch->firstItem &&
> > > + (posItem == batch->lastItem ||
> > > + (hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED));
> > > + }
> >
> > Might be worth putting into a static inline.
>
> Why do you say that? I don't follow.

It makes sense to have the forward/backward loops for runtime efficiency, but
it seems just about enough code that it'd be nice to put it into a helper
always function that's called with a constant is_forward=0/1 argument. Or
perhaps just have the loop body in a helper.

But it's a purely local concern, so we can do that at any later point, so it
really doesn't matter too much right now.

> > Can'tt the *2 lead to computing a xs_vm_items bigger than scan->maxitemsbatch?
>
> Yes, it can.

But we prevent it from overflow by a Max()/Min() in the loop headers.

> > > +tableam_util_batchscan_end(IndexScanDesc scan)
> > > +{
>
> > > + pfree(index_scan_batch_base(scan, cached));
> >
> > I'd set scan->batchcache[i] to NULL after freeing.
>
> I tried that out, but I *think* it regressed pgbench SELECT by about
> ~1.5% of total TPS (I'd need more thorough testing to confirm this).
> Are you sure we need this?

That seems surprising. But no, I don't think we need this.

> > > + if (batch->numDead > 0 &&
> > > + scan->indexRelation->rd_indam->amkillitemsbatch != NULL)
> > > + {
> > > + if (batch->numDead > 1)
> > > + {
> > > + qsort(batch->deadItems, batch->numDead, sizeof(int),
> > > + batch_compare_int);
> > > + batch->numDead = qunique(batch->deadItems, batch->numDead,
> > > + sizeof(int), batch_compare_int);
> > > + }
> > > +
> > > + scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch);
> > > + }
> >
> > Based on the name or function header one sure wouldn't think this will call
> > indexam callbacks...
>
> I dunno, how do you think it should work?

It's perhaps ok, I just was a bit surprised to see the callback inside a
function with that name. I'd perhaps just name it
tableam_util_release_batch() (and rename batch_free accordingly).

> To me it makes sense to call amkillitemsbatch here, because when the
> table AM frees a batch it clearly has no further need to perform heap
> fetches for any items in that batch. The table-AM-focussed function
> name seems fine to me because this is one extra step, that happens to
> involve calling the index AM.

Yea, I don't have a problem with it being named tableam_*.

> > > - scan->kill_prior_tuple = false; /* for safety */
> > > scan->xs_heap_continue = false;
> > >
> > > - scan->indexRelation->rd_indam->amrestrpos(scan);
> > > + table_index_fetch_restrpos(scan);
> > > }
> >
> > Did we loose that kill_prior_tuple? Or is there a reason it's not necessary
> > anymore? Too tired to think through this.
>
> We still need kill_prior_tuple, but only for amgettuple scans. Those
> don't support mark + restore anymore.

I'd probably just assert !kill_prior_tuple.

Greetings,

Andres Freund

In response to

Re: index prefetching at 2026-04-05 05:37:02 from Peter Geoghegan

Responses

Re: index prefetching at 2026-04-06 05:47:43 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2026-04-05 23:12:08	Re: Stack-based tracking of per-node WAL/buffer usage
Previous Message	Melanie Plageman	2026-04-05 22:50:25	Re: EXPLAIN: showing ReadStream / prefetch stats