Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-06-07 00:52:27
Message-ID: CAH2-WzmAKApAYttMQfhh-usweXvRYpHcku-OKHYgDvb=wLSWHg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 5, 2026 at 7:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Why would it require creating a tableam shaped slot internally? It seems
> rather silly to store a tuple fetched to verify visibility into a slot, where
> it is never used. Storing in a tuple requires an indirect things like
> IncrBufferRefCount(), which isn't that cheap.

Attached is v26, which resolves many outstanding issues from when we
paused work on this, at the end of the final CF for Postgres 19.

I'll highlight each change with its own bullet-point. The stand-out
item is probably:

* A new patch now implements gistgetbatch, which makes GiST scans do
index prefetching.

Performance is very similar to nbtree, so query execution time
reductions of 20x or more are quite possible. heapam's prefetching
approach generalizes very well across all core index AMs that support
plain scans. This proves that the amgetbatch design is general enough
to completely replace amgettuple, without any real downsides for index
AMs.

I want to flag something up-front about the GiST patch: it fixes a
long standing bug in GiST index-only scans [1]. But there's a
catch/downside: the fix works (in part) by teaching the planner to
disallow index-only scans that use orderbyclauses, such as the <->
operator.

I'll discuss why this makes sense at the end of the email. Note that
the ordered scan issue isn't related to the amgetbatch design; it's
fundamentally a problem with GiST, which has long ignored the rules
about buffer pins explained in the docs "Index Locking Considerations"
section. Making it follow those rules seems impossible for ordered
scans specifically (but is quite doable for all other GiST scan
types).

Slot-related improvements
-------------------------

I made many improvements to the high-level slot-passing API functions.
I also improved certain lower-level details of how we use slots within
heapam_indexscan.c. To start:

* Index-only scans no longer need to use a buffered tuple table slot
at all. As such, IoS callers to table_index_getnext_slot are no longer
required to "lend heapam a buffered tuple slot", as in previous
versions.

Andres pointed out that the previous approach was silly, and it was.
Now index-only scans use a temp HeapTuple directly; calling
heap_hot_search_buffer doesn't truly require passing a slot (not now,
at least).

> I could see passing an ine shaped slot to table_index_getnext_ios(), which
> would improve the layering for nodeIndexonlyscan.c further. It's a bit gross
> that there's this stuff in IndexOnlyNext()
>
> /*
> * Fill the scan tuple slot with data from the index. This might be
> * provided in either HeapTuple or IndexTuple format. Conceivably an
> * index AM might fill both fields, in which case we prefer the heap
> * format, since it's probably a bit cheaper to fill a slot from.
> */
> if (scandesc->xs_hitup)
> {

Agreed that the layering improvements within nodeIndexonlyscan.c still
didn't go far enough in previous versions. Which brings me to my next
highlight/improvement:

* Index-only scans now use their own virtual slot to retrieve rows
through table_index_getnext_slot, in the obvious and idiomatic way
(just like plain scans, though using a different slot type). Places
like nodeIndexonlyscan.c no longer perform ad-hoc reading of
IndexScanDesc fields such as xs_itup, xs_hitup, and xs_recheck.

Implementing this item necessitated moving the "Fill the scan tuple
slot with data from the index" logic into heapam_indexscan.c, just as
you'd predicted. I think that that's basically fine. I'm not entirely
sure if the relevant function needs to be exported so other table AMs
can call it.

A non-obvious advantage of this approach is that it fixes a
theoretical latent bug in selfuncs.c: it will sometimes scans an
nbtree index on a name column, with cstring index storage, while
arguably expecting that there is a full NAMEDATALEN of storage for
each datum. This is not actually a bug in practice; we don't really
read past the end of batch memory, because selfuncs.c doesn't actually
do anything that runs that risk. Better to nail it down, though.

Note that I didn't invent a new table_index_getnext_ios shim function
(which Andres suggested during related discussion). The existing
table_index_getnext_slot already provided everything needed.
table_index_getnext_slot will now assert that its caller is using the
expected kind of slot (either TTS_IS_BUFFERTUPLE or TTS_IS_VIRTUAL)
according to whether the scan is !scan->xs_want_itup. I don't feel
very strongly about this point, though.

IndexScanDesc and IndexFetchTableData structs
---------------------------------------------

A few notable things to say here.

> > > > - struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
> > > > + struct IndexFetchTableData *(*index_fetch_begin) (IndexScanDesc scan,
> > > > + uint32 flags);
> > >
> > > I kinda wonder if this should look roughly like this instead:
> > >
> > > /* size of the tableam's IndexFetchTableData */
> > > size_t index_fetch_state_size;
> > > void *(*index_fetch_begin) (Relation rel, IndexScanDesc scan, IndexFetchTableData *table_fetch, uint32 flags);
> > >
> > > Moving the allocation of the IndexFetchTableData to the caller. The advantage
> > > would be that this would allow us to later combine the allocation of the
> > > IndexScanDesc with the IndexFetchTableData, without again needing to change
> > > the tableam interface.

Maybe it makes sense to share one allocation for everything in the way
that you've outline here, but I haven't done that yet. That seems like
strictly an optimization to me. However:

* I completely removed the IndexFetchTableData struct, which seems
pointless given our extensive use of IndexScanDesc in
heapam_indexscan.c.

We do keep the underlying heapam struct (IndexFetchHeapData), which is
now accessed through a void pointer in IndexScanDesc -- exactly like
it's done with index AMs. There's no more "inheritance from
IndexFetchTableData", which seems to add nothing anymore.

There's also a tangentially related index AM/table AM API improvement:

* New IndexAmRoutine.amcanmarkpos field determines whether an index AM
can support mark/restore.

Only nbtree sets IndexAmRoutine.amcanmarkpos=true. We now require that
index AMs explicitly opt in to mark/restore support (like on master),
rather than implicitly making every index AM support mark/restore just
because they use amgetbatch. That was the theoretical contract in
previous patch versions, which was silly; while hash index scans can
arguably support mark/restore, allowing it makes no sense. And GiST
index scans certainly cannot support it. GiST fundamentally expects to
control the progress of the scan itself, even with the new
gistgetbatch support (which is why it still sets
IndexAmRoutine.amcanbackward=false and
IndexAmRoutine.amcanmarkpos=false along with it, and why it doesn't
even save sibling links in its opaque area).

Batch allocation, layout, opaque areas
--------------------------------------

I've improved how batches are laid out in memory. These improvements
affect both table AMs and index AMs in various ways. Improving the
layout of batches (and making related API enhancements) addressed
problems in many areas.

> I guess I was just a bit saddened by
> indexam_util_batch_alloc()
> ->table_index_fetch_batch_init()
> having to do
>
> scan->heapRelation->rd_tableam->index_fetch_batch_init(scan, batch, new_alloc);
>
> that's a lot of indirection to go through.

Most of the indirection is now avoided:

* We now avoid calling index_fetch_batch_init during scans whose table
AM didn't request any opaque area be made available in batches.

In practice this means that heapam plain index scans (and all bitmap
index scans) don't call index_fetch_batch_init at all. Now
index_fetch_batch_init is strictly about initializing the table AM
opaque area; scans without such an area in their batches will never
receive any calls to their index_fetch_batch_init callback.

* Layering improvement, related to the last item: We no longer
condition whether a table AM opaque area is included in each allocated
batch on whether the scan is index-only or not. We simply condition it
on whether the table AM asked us for an opaque area at the start of
the scan.

Obviously, other table AMs might have "table AM opaque area"
requirements that don't match heapam. The table AM can choose to use
an opaque area for its own reasons, whatever they are. It can also
control the precise layout of the opaque area. This area might include
a per-item portion, usually for visibility info, but it doesn't have
to; it can be a fixed-size struct or a simple array without any struct
header.

* We now make the table AM opaque area in each batch a single
contiguous area. heapam uses a flexible array member for this (when it
requests its own opaque area in the first place).

* heapam_index_fetch_batch_init no longer needs to memset() visibility
info as batches are allocated/recycled.

The HeapBatchData struct now contains the bounds of a range of
known-set items from the batch it is contained within (this
information is in its header struct, before the flexible array member
at the end). Visibility info for each batch item is stored in
HeapBatchData's flexible array member. When index_fetch_batch_init is
called, heapam_index_fetch_batch_init will just invalidate the
HeapBatchData bounds instead of calling memset.

Avoiding the memset might be a minor performance win (completely
skipping most index_fetch_batch_init calls certainly was). But FWIW I
actually did things this way because it seemed a bit clearer to me.
And because it proves that the new flexible array scheme actually
works.

gistgetbatch support and batch layout
-------------------------------------

As I mentioned, other changes to the batch layout specifically concern
index AMs. One AM in particular: GiST. I mentioned at the start of
this email that GiST scans are now supported. This improvement affects
the core indexbatch.h contract for index AMs, but actually appears in
the later GiST patch:

* Index AMs can now optionally request a second, dynamically-sized
opaque area (similar to those that table AMs request).

This area should not be confused with the standard
fixed-offset/fixed-size struct that the index AM must use in every
batch. That remains mandatory, and is still at a fixed offset, known
at compile time. Only GiST (and maybe SP-GiST) will find this second
area concept useful. It's a relatively niche facility.

In short, this new area gives those index AMs a way to store
tuple-level state required to implement all of the features that
weren't possible in earlier versions of the patch set. This other
opaque area gives the index AM "just enough tuple-level processing to
get by". gistgetbatch scans are fundamentally structured around
generating batches for each leaf page, just like btgetbatch and
hashgetbatch scans always were. However, a new amgettransform callback
is registered when the AM needs to perform tuple-level processing
later (at the point where the executor actually needs to return
items).

GiST's new amgettransform callback (gitgettransform)
----------------------------------------------------

This two-phase approach keeps the scan process simple. We store GiST's
on-disk format tuples in currTuples (just like nbtree), only
converting them into scan->xs_hitup format tuples when they are
actually consumed/returned. That way we can size each batch's
currTuples to BLCKSZ, enabling the same batch memory management
strategy used in other index AMs -- we can freely recycle batches
because their layout is fixed and predictable (for the duration of a
given scan).

Our approach generally simplifies resource management: GiST support
functions like leaking memory, but amgettransform can reset an
internal memory context every time another tuple is returned.

We primarily need amgettransform to handle rechecks in the core
executor, which might be required for any individual tuple, during any
GiST index scan. GiST opclasses have always been permitted to
implement a consistent function that can unpredictably cause
individual tuples to require (or not require) a recheck in the core
executor (unlike hash and nbtree, which always and never require a
recheck respectively). That "tuple needs recheck" information must
travel with the matching items in the same batch, so it can ultimately
be returned to executor node callers (table_index_getnext_slot
callers) at the tuple granularity.

Why GiST ordered index-only scans were disabled, "virtual" batches
------------------------------------------------------------------

As I mentioned in passing at the start of this email, GiST ordered
index-only scans are disabled in the planner by the GiST patch. This
disabling is necessary to fully fix the bugs in GiST index-only scans
[1]. Most GiST index-only scans now follow the exact same locking
protocol as nbtree index-only scans, which fixes the bug for those
scan types (note that GiST VACUUM now acquires cleanup locks that'll
conflict with pins held by scans). But ordered scans are different,
and require their own custom solution: disabling them altogether in
the planner.

The crucial difference between ordered GiST scans and "standard" GiST
scans is in how batches are formed. Ordered scans use "virtual
batches", which are groups of tuples that we *pretend* were returned
from a single leaf page that stored the tuples in the scan's required
sort order. In reality, the underlying tuples may originate from many
different leaf pages.

Holding many buffer pins as an interlock against concurrent TID
recycling (released by gistunguardbatch during index-only scans) just
isn't practical here, since we might have to hold an indeterminate
number of pins at once. It seems very difficult, perhaps impossible,
to limit the worst-case number of required pins. Why shouldn't a badly
implemented GiST opclass require an ordered index-only scan to pin
almost every leaf page in the index before GiST can return even one
index tuple to the scan? How can we nail down all that stuff in a way
that works in the general case?

This isn't a problem for prefetching's read stream -- the problem is
that the backend might run out of pins. I don't see a way around that
problem. Even if it's possible, the effort and testing required to get
it right seems unlikely to be worth it.

Another closely related consequence of using virtual batches is that
we cannot safely perform LP_DEAD-marking of index tuples within
gistkillitemsbatch. That restriction is nothing new, though; that has
never been supported. In general, using virtual batches (for ordered
scans) implies restrictions similar to those that have long affected
bitmap scans (no VM access due to the resulting TID interlock race
conditions [2], no setting of LP_DEAD bits).

Things I haven't done
---------------------

* There have been recent CI failures on "linux-autoconf", that I
haven't debugged.

001_aio.pl seems to reliably report "Dubious, test returned 4 (wstat
1024, 0x400)" on this CI target. I fully expect this v26 to fail when
CFTester runs it through CI.

I strongly suspect an issue in Andres' "Allow read_stream_reset() to
not wait for IO completion" patch is involved here. It could just be a
bug in the new 001_aio.pl tests added by that patch. It certainly
cannot be a bug in index prefetching itself, since it isn't exercised
by 001_aio.pl at all.

* I've not given much thought to improving the cost model of index
scans, to account for prefetching.

I believe the planner's cost functions give bitmap scans an undue
advantage over plain scans, at least with prefetching. It would
definitely make sense to start thinking about how to fix this problem,
but I haven't really started on that.

I think that we probably need to do something like stop charging
random_page_cost (and charge sequential page cost instead) once the
scan has processed a certain number of index pages and/or heap pages.
That is roughly how it's done already when costing bitmap heap scans.

* Continued to care about pg_attribute_always_inline making certain
function prototypes overlong and ugly.

Andres has a patch to shorten pg_attribute_always_inline to
pg_always_inline in the open CF, which will presumably be committed
soon. I don't want to forego prototypes; I am particular about the
order in which function definitions appear. For now, ugly and overlong
prototypes seem acceptable (I won't commit any patches with that
issue).

> Which is what then made me think of embedding something like a
> TableIndexFetchRoutine (with batch_init, reset, markpos, restrpos, end
> members) in IndexScanDesc. That way table_index_{fetch_reset,batch_Init,...}()
> wouldn't need this level of indirection.

* I didn't go this far.

> > Code reuse is a virtue, but it isn't the highest virtue. Splitting up
> > this function is possible but quite awkward.
>
> Not having the same code in multiple table AMs would make it easier to write
> them, which would be nice in itself, but I am more concerned with ending up
> with copies of heapam_index_getnext_scanbatch_pos() in various AMs, us finding
> problems with heapam_index_getnext_scanbatch_pos() in a minor release, and
> having a harder time adjusting things due to the copied code.

* I still haven't made progress on making
heapam_index_getnext_scanbatch_pos into a generic helper.

* I haven't added SP-GiST support.

I think that this can be done, but I'm not sure if it's worth the
trouble. SP-GiST VACUUM seems to do some weird things with buffer
pins, and I don't really understand how it deals with redirects.
Besides, we might be better off keeping SP-GiST as-is, so we have one
amgettuple index AM left for testing purposes.

The GiST patch should remove any doubt that the new amgetbatch API is
a suitable replacement for amgettuple; we should actually deprecate
amgettuple at some point. But I'm in no hurry.

[1] https://postgr.es/m/CAH2-Wz=jjiNL9FCh8C1L-GUH15f4WFTWub2x+_NucngcDDcHKw@mail.gmail.com
[2] See commit 459e7bf8, which showed that bitmap scans cannot safely
access the VM as an optimization (this is very similar to the extant
bugs in GiST and SP-GiST)
--
Peter Geoghegan

Attachment Content-Type Size
v26-0008-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch application/octet-stream 6.3 KB
v26-0001-Add-slot-based-table-AM-index-scan-interface.patch application/octet-stream 107.3 KB
v26-0006-heapam-Add-index-scan-I-O-prefetching.patch application/octet-stream 49.5 KB
v26-0007-Allow-read_stream_reset-to-not-wait-for-IO-compl.patch application/octet-stream 20.5 KB
v26-0009-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch application/octet-stream 3.1 KB
v26-0004-WIP-Adopt-amgetbatch-interface-in-GiST-index-AM.patch application/octet-stream 101.3 KB
v26-0003-Adopt-amgetbatch-interface-in-hash-index-AM.patch application/octet-stream 47.3 KB
v26-0005-heapam-Optimize-pin-transfers-during-index-scans.patch application/octet-stream 6.6 KB
v26-0002-Add-amgetbatch-interface-and-adopt-it-in-nbtree.patch application/octet-stream 255.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-06-07 01:43:34 Re: Fix tuple deformation with virtual generated NOT NULL columns
Previous Message Noah Misch 2026-06-07 00:02:18 Re: Non-text mode for pg_dumpall