Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-12-01 01:23:56
Message-ID:	CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Nov 10, 2025 at 6:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The new tentative plan is to cut scope by focussing on switching over
> to the new index AM + table AM interface from the patch in the short
> term, for Postgres 19.

Attached patch makes the table AM revisions we talked about. This is a
significant change in direction, so I'm adopting a new patch
versioning scheme: this new version is v1. (I just find it easier to
deal with sequential patch version numbers.)

I'm sure that I'll have made numerous mistakes in this new v1. There
will certainly be some bugs, and some of the exact details of how I'm
doing the layering are likely suboptimal or even wrong. I am
nevertheless cautiously optimistic that this will be the last major
redesign that will be required for this project.

> There is an almost immediate benefit to just
> doing that much, unrelated to I/O prefetching for index scans: it
> enables batching of heap page buffer locking/unlocking (during the
> point where index scans perform heap_hot_search_buffer calls) on the
> table AM/heapam side during ordered index scans.

Note that this new v1 doesn't yet include the important heapam buffer
locking optimization discussed here. It didn't seem worth holding up
everything just for that. Plan is to get to it next.

(It isn't intrinsically all that complicated to add the optimization
with this new table AM orientated structure, but doing so would have
made performance validation work/avoiding regressions with simple
queries that much harder. So I just put it off for a bit longer.)

What's new in v1 (compared to v20251109-*, the prior version):

* The first patch in the series is now mostly about changing the table
AM and index AM in a complementary way (not just about adding the
amgetbatch interface to the index AM).

To summarize this point (mostly just a recap of recent discussion on
the table AM API on this thread) with its own sub points:

- We're now using a slot-based table AM interface that understands
scan direction. We now do all VM access for index-only scans on the
heapam side, fixing that existing table AM modularity violation once
and for all.

- Batches returned by amgetbatch are directly managed by heapam,
giving it the direct control that it requires to get the best possible
performance. Whether that's for adding I/O prefetching, or for other
optimizations.

- The old table_index_fetch_tuple index scan interface is still needed
-- though only barely.

The rule going forward for core executor code is that it should always
use this new slot-based interface, unless there is a specific need for
such a caller to pass *their own* TID, in a way that cannot possibly
be delegated to our new high level table AM interface.

For example, we still need table_index_fetch_tuple for nbtree's
_bt_check_unique; it must pass TIDs to heapam, and get back tuples,
without starting any new index scan to do so (the only "index scan"
involved in the case of the _bt_check_unique caller takes place in the
btinsert that needs to perform unique index enforcement in passing). I
think it makes perfect sense that a small handful of special case
callers still need to use table_index_fetch_tuple, since there really
is no way around the need for these callers to pass their own TID.

* Major restructuring of batch management code, to allow it to work
with the table AM interface (as well as related improvements and
polishing).

The parts of batch management that aren't under the direct control of
table AMs/heapam (the batch helper functions that all table AMs will
use) are no longer in indexam.c; there's a new file for those routines
named indexbatch.c. indexbatch.c is also the place where a few other
helper functions go. These other functions are called by indexam.c/the
core executor, for things like initializing an amgetbatch scan, and
informing nbtree that it is taking a mark (for mark/restore).

Maybe there are certain remaining problems with the way that indexam.c
and heapam_handler.c are coordinating across index scans. Hopefully
the structure wasn't accidentally overfitted to heapam/isn't brittle
in some other way.

* Renamed and made lots of tweaks to batching related functions and
structs. I've also integrated code that previously appeared in its own
"batch cache" patch into the new main commit in the patch series (the
first patch in the new series).

The main goal of the tweaks to the data structures was to avoid
indirection that previously caused small regressions in my
microbenchmarks. We're very sensitive to costs from additional pointer
chasing in these code paths. And from even small memory allocations.

I think that I've avoided all regressions with just the first patch,
at least for my own microbenchmark suite. I did not aim to avoid these
regressions with the prefetching patch, since I consider it out of
scope now (for Postgres 19).

* v1 breaks out prefetching into its own patch, which is now the
second patch in the patch series.

The new I/O prefetching patch turned out to be surprisingly small. I
still feel good about our choice to put that off until Postgres 20,
though -- it's definitely where most of the difficulties are.
Especially with things like resource management. (The problem with the
second patch is that it's too small/doesn't address all the problems,
not that it's too big and unwieldy.)

Prefetching works at least as well as it did in earlier versions
(maybe even slightly better). It's not just an afterthought here. At a
minimum, we need to continue to maintain prefetching in a reasonably
complete and usable form to keep us honest about the design changes in
the table AM and index AM APIs. If the design itself cannot eventually
accommodate Postgres 20 work on I/O prefetching (and even later work),
then it's no good.

Minor caveat about preserving prefetching in good working order: I
disabled support for index-only scans that use I/O prefetching for
heap accesses in the second patch, at least for now. To recap, IoS
support requires a visibility cache so that both readBatch and
streamBatch agree on exactly which heap blocks will need to be read,
even when the visibility map has some relevant heap page bits
concurrently set or unset. It won't be too hard to add something like
that back to heapam_handler.c, but I didn't get around to doing so
just yet.

It might be independently useful to have some kind of visibility
cache, even without prefetching; batching VM accesses (say by doing
them up front, for a whole batch, right after amgetbatch returns)
might work out saving cycles with cached scans. You know, somewhat
like how we'll do same-heap-page heap tuple fetches eagerly as a way
of minimizing buffer lock/unlock traffic.

* There's a new patch that adds amgetbatch support for hash indexes.

This demonstrates that the amgetbatch interface is already reasonably
general. And that adding support to an index AM doesn't have to be all
that invasive. I'm more focussed than ever on the generality of the
API now.

* Added documentation that attempts to formalize the constraints that
index AMs that opt to use amgetbatch are under.

I don't think that it makes sense to think of amgettuple as the legacy
interface for plain index scans. There will probably always be cases
like KNN GiST scans, that legitimately need the index AM to directly
control the progress of index scans, a tuple at a time.

After all, these scan types give an absurd amount of control over many
things to the index AM -- that seems to really make it hard to put the
table AM in control of the scan's progress. For example, GiST scans
use their own GISTSearchHeapItem struct to manage each item returned
to the scan (which has a bunch of extra fields compared to our new
AM-generic BatchMatchingItem struct). This GISTSearchHeapItem struct
allows GiST to indicate whether or not an index tuple's quals must be
rechecked. It works at the tuple granularity (individual GiST
opclasses might expect that level of flexibility), which really works
against the batching concepts that we're pursuing here.

It's true that hash index scans are also lossy, but that's quite
different: they're inherently lossy. It's not as if hash index scans
are sometimes not lossy. They certainly cannot be lossy for some
tuples but not other tuples that all get returned during the same
index scan. Not so with GiST scans.

Likely the best solution to the problems posed by GiST and SP-GiST
will be to choose one of either amgettuple and amgetbatch during
planning, according to what the scan actually requires (while having
support for both interfaces in both index AMs). I'm still not sure
what that should look like, though -- how does the planner know which
interface to use, in a world where it has to make a choice with those
index AMs that offer both? Obviously the answer depends in part on
what actually matters to GiST/where GiST *can* reasonably use
amgetbatch, to get benefits such as prefetching. And I don't claim to
have a full understanding of that right now.

Here are the things that I'd like to ask from reviewers, and from Tomas:

* Review of the table AM changes, with a particular emphasis on high
level architectural choices.

* Most importantly: will the approach in this new v1 avoid painting
ourselves into a corner? It can be incomplete, as long as it doesn't
block progress on things we're likely to want to do in the next couple
of releases.

* Help with putting the contract that amgetbatch requires of index AMs
on a more rigorous footing. In other words, is amgetbatch itself
sufficiently general to accomodate the needs of index AMs in the
future? I've made a start on that here (by adding sgml docs about the
index AM API, which mentions table AM concerns), but work remains,
particularly when it comes to supporting GiST + SP-GiST.

I think it makes sense to keep feedback mostly high level for now --
to make it primarily about how the individual API changes fit
together, if they're coordinating too much (or not enough), and if the
interface we have is able to accommodate future needs.

--
Peter Geoghegan

Attachment	Content-Type	Size
v1-0004-Add-amgetbatch-support-to-hash-index-AM.patch	application/octet-stream	36.2 KB
v1-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patch	application/octet-stream	197.8 KB
v1-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch	application/octet-stream	6.9 KB
v1-0002-Add-prefetching-to-index-scans-using-batch-interf.patch	application/octet-stream	32.1 KB

In response to

Re: index prefetching at 2025-11-10 23:59:07 from Peter Geoghegan

Responses

Re: index prefetching at 2025-12-01 16:32:30 from Tomas Vondra
Re: index prefetching at 2025-12-04 05:54:09 from Amit Langote
Re: index prefetching at 2025-12-08 20:50:56 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2025-12-01 03:33:05	Re: meson and check-tests
Previous Message	Hayato Kuroda (Fujitsu)	2025-12-01 00:58:33	RE: [buildfarm related] Machines gcc experimental failed test_lfind