Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-01-18 23:51:58
Message-ID: CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 13, 2026 at 3:36 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The batch stopped applying again. Attached is v7.

Attached is v8. A lot has changed in just the past week. We're feeling
optimistic about getting I/O prefetching into Postgres 19, and so
we've decided to fully commit ourselves to that.

All of the remaining kludges that we've relied on up until now have
either been removed, or replaced with a principled approach. Notable
changes in this v8 compared to v7:

* We no longer "reset" the read stream to deal with exhaustion of the
available slots for batches. We now rely on a patch of Thomas Munro's
[1] to pause the read stream from within the new heapam callback, in
the event of running out of batch slots.

When this happens we pause only for as long as it takes for the scan
to return enough items to make the scan need to advance scanPos to the
next batch in the ring buffer (obviously, this next batch must be
already loaded in this scenario). At that point we'll resume the read
stream, which will once again be able to call amgetbatch to store
another batch in the just-freed slot.

Running out of batch slots like this happens rarely, so it's important
to have a reasonably simple and elegant solution. Which this seems to
be. Importantly, there's no "ping ponging" behavior here.

* Removed grotty heuristics in our read stream call back to avoid regressions.

These first appeared many months ago, and no longer appear to be
necessary. There's still some regressions with pathological cases, but
they seem to be well within acceptable bounds. I saw about a 10%
remaining increase in query execution time for an adversarially
crafted query that Tomas came up with back in August [2]. That query
provided the original justification for my inventing those heuristics.

* We now have a real strategy around resource management and buffer pins.

A concern long held by Andres (and others) was that holding on to many
index leaf page buffer pins would somehow conflict with read stream's
careful management of heap page buffer pins [3]. We were particularly
concerned about hard to hit cases where the read stream is somehow
limited by recently acquired buffer pins for index pages. This was
probably the thing that made me most doubtful about our ability to get
the prefetching patch in shape for Postgres 19. But that's completely
changed in just the past week.

Dropping leaf page buffer pins during index-only scans
------------------------------------------------------

I realized (following a point made by Matthias van de Meent during a
private call) that it simply isn't necessary for index pages to hold
on to buffer pins, in general. We haven't actually done that with
nbtree plain index scans with MVCC snapshots since 2015's commit
2ed5b87f, which added what we now call nbtree's dropPin optimization.
What if we could find a way to teach *every* amgetbatch-supporting
index AM to do the same trick for *all* scans (barring non-MVCC
snapshot scans)? Including index-only scans and scans of unlogged
relations? Then we'd have zero chance of unintended interactions with
the read stream; there'd simply be no extra buffer pins that might
confuse the read stream in the first place!

Matthias told me that his patch to fix the bugs in GiST index-only
scans works by not dropping a pin on a GiST leaf page right away. It
delays dropping such pins, but only very slightly: if we cache
visibility information from the VM (which we're doing already in the
amgetbatch patch, and which Matthias' patch does too), and delay
dropping a batch's leaf page pin until after its VM cache is loaded,
it reliably avoids races of the kind that we need to be worried about
here. In short, we can eagerly drop buffer pins during index-only
scans in the same way (or virtually the same way) that we've long been
able to with nbtree plain index scans thanks to nbtree's dropPin
optimization.

The race in question involves VACUUM concurrently setting a VM page
all visible on a heap page with a TID that is also recycled by VACUUM
(as it sets its page all-visible). We can safely allow VACUUM to go
ahead with this while still dropping our pin early -- provided we build our
local cache of visibility information first. Holding on to a leaf page
pin while reading from the VM suffices. The important principle is
that our local cache of VM info is (and will remain) consistent with
what we saw on the index page when we saved its matching TIDs into a
batch. (It doesn't matter that we do heap fetches for now-all-visible
pages, because they cannot possibly be visible to the scan's MVCC
snapshot. Just like in the plain index scan dropPin case. And rather
like bitmap index scans.)

v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch has a
new isolation test that demonstrates the new "drop pins eagerly during
index-only scans" behavior, which is named
index-only-scan-visibility.spec. The isolation test is a variant of
the one I posted on the GiST thread, which proved that GiST is broken
here (a problem that Matthias is working on fixing). If you attempt to run this
isolation test on master, it'll block forever; VACUUM can never acquire a
cleanup lock due to a conflicting buffer pin held by an index-only scan.
That doesn't happen with v8 of the patch, though; it completes in less than
20ms on my system (and the scan actually gives correct results!).

This still leaves non-MVCC snapshot scans. There's nothing we can do
to avoid holding on to a leaf page buffer pin while accessing the heap
there. But that's okay; now we just refuse to do I/O prefetching
during such scans.

Dropping leaf page buffer pins during scans of an unlogged relation
-------------------------------------------------------------------

Another thing that hinders nbtree's dropPin optimization (and that we
must deal with to get a guarantee that leaf page buffer pins never
really need to be kept around) is the use of an unlogged relation.
That breaks dropPin's approach to detecting unsafe concurrent TID
recycling when marking dead items LP_DEAD on index pages, since that
involves stashing a page LSN, and then checking if it has changed
later on.

We solve that problem by introducing GiST style "fake LSNs" to both
ntbree and hash. Now the same LSN trick works for unlogged relations,
too.

We now require that any other amgetbatch index AMs that might be added
in the future also use fake LSNs like this. Alternatively, such an
index AM could just not provide a _bt_killitems-like mechanism at all
-- that also works. Or, they could limit the use of such a mechanism
to logged relations. Third party table AMs don't really need to deal with
this themselves, though.

Performance impact of calling BufferGetLSNAtomic during affected scans
----------------------------------------------------------------------

Our expanded use of BufferGetLSNAtomic() during index-only scans has
the potential to cause regressions, at least when page checksums are
enabled. We're planning on relying on a patch of Andreas Karlsson's to
make BufferGetLSNAtomic use an atomic op [4], which fixes this
regression. I'm not including that here, though (I would but for the
fact that it breaks the Debian Trixie CI target due to a known
misaligned access bug that Andreas is working on fixing). Anybody that
does performance validation of either index-only scans or scans of an
unlogged relation should bear that in mind.

> > I still haven't had time to produce an implementation of the "heap
> > buffer locking minimization" optimization that's clean enough to
> > present to the list.
>
> Still haven't done this.

This idea has now been deprioritized. Quite a few things have fallen
together recently, so we're "pivoting back" to work on prefetching for
Postgres 19.

[1] https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com
[2] https://postgr.es/m/9af33041-1c16-4973-855a-718aa1048ee1@vondra.me
[3] https://postgr.es/m/mc5w6mj52dzl7ant7nmjmwxjmvmlwekwjmf77eotrra3pghrfl@d7mq3hxvdapa
[4] https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se

--
Peter Geoghegan

Attachment Content-Type Size
v8-0006-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch application/x-patch 6.9 KB
v8-0008-Make-hash-index-AM-use-amgetbatch-interface.patch application/x-patch 36.4 KB
v8-0007-Add-fake-LSN-support-to-hash-index-AM.patch application/x-patch 13.7 KB
v8-0005-Add-prefetching-to-index-scans-using-batch-interf.patch application/x-patch 52.8 KB
v8-0001-Extract-fake-LSN-infrastructure-from-GiST-index-A.patch application/x-patch 16.8 KB
v8-0004-Introduce-read_stream_-pause-resume-yield.patch application/x-patch 11.8 KB
v8-0002-Add-fake-LSN-support-to-nbtree.patch application/x-patch 11.9 KB
v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch application/x-patch 189.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-01-19 00:13:12 Re: Add WALRCV_CONNECTING state to walreceiver
Previous Message David Rowley 2026-01-18 22:13:16 Re: More speedups for tuple deformation