Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-01-20 20:41:52
Message-ID: CAH2-Wzm2ADrCAMKLVquJSZ=4SbHZpRGnSCbpoiwuGvhms3w4iA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 19, 2026 at 2:45 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> On Mon, 19 Jan 2026 at 00:52, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Dropping leaf page buffer pins during index-only scans
> > ------------------------------------------------------
> >
> > I realized (following a point made by Matthias van de Meent during a
> > private call) that it simply isn't necessary for index pages to hold
> > on to buffer pins, in general.

> Correct: My patch does the VM lookups that are currently being done by
> IndexOnlyNext just before releasing the index page, and eventually
> passes on that VM lookup result to IndexOnlyNext; allowing the scan to
> drop the pin on the index page without opening up to the mentioned
> race condition with vacuum.
>
> One major difference between my patch and this patch, is that in my
> patch, the index AM itself asks the tableAM to do visibility checks on
> a set of TIDs, at a point where the index AM would like to drop the
> pin.
>
> With your v8-0003, the index doesn't have access to such an API
> (instead, heap_batch_getnext does VM checks), which makes it more
> difficult for index AMs to implement the batch API when it can't
> return tuples one page at a time (like GiST, with ordered scans). If
> heap_batch_resolve_visibility (or my patch's
> table_index_vischeck_tuples() API, see v14-0003) were exposed for
> indexes to apply, and if the batch API allowed the index to drop the
> pin instead of requiring the tableAM to do that, then that'd be a huge
> enabler for an implementation of amgetbatch() for GiST/SP-GiST.

> > The race in question involves VACUUM concurrently setting a VM page
> > all visible on a heap page with a TID that is also recycled by VACUUM
> > (as it sets its page all-visible).

> Correct. One thing to note here is that this doesn't mean you can take
> the VM results you got whilst reading one index page and use them for
> the tuples of another index page; it'd likely cause issues if this
> isn't tracked correctly.

Right. In the index-only scan scenarios we need to worry about here,
we really do need to hold the buffer pin while accessing the VM for
all of the item's from that page. And we cannot assume anything about
the same heap blocks at some later time.

The dead TIDs from an index page that point to that same heap page
(the ones that VACUUM concurrently removes) must be dead to everybody
in these scenarios. And, the heap page that is concurrently set
all-visible by the same VACUUM really is all-visible to everybody (at
least by the time VACUUM reaches the point in lazy_vacuum_heap_page
where it sets the page all-visible, having also set some LP_DEAD heap
stubs as LP_UNUSED for the now-removed TIDs from the index). Nobody
disagrees about the liveness or deadness of any relevant tuple here --
that was never the problem. The GiST index-only scan race (the one
that I also need to avoid in this patch) is fundamentally a failure to
account for an inconsistency between two physical data structures: an
index page and the visibility map.

As you know, we already drop the pin right away during nbtree plain
index scans, thanks to nbtree's dropPin optimization (when the scan
uses an MVCC snapshot). This is safe for the same reason that it's
safe during bitmap index scans: we can safely perform heap fetches
that look at some completely unrelated newly inserted heap tuple,
because our MVCC snapshot will prevent the scan from ever returning
that tuple to the scan in any case. In fact, we can even do heap
fetches of LP_UNUSED heap items lacking tuple storage, which will be
treated as if they were tuples that were not visible to the scan.

On the other hand, during index-only scans it would be wrong to drop
the leaf page pin before we do any VM lookups/load a cache of
visibility info for the items on that page. VACUUM will routinely set
a heap page that contains LP_UNUSED items all-visible. Unlike the
plain index scan case, there's no way for the scan to reliably know
that the TID from the index points to an LP_UNUSED item on an
all-visible heap page, and as such shouldn't be considered
all-visible. The "treat LP_UNUSED as dead" trick works, but
(obviously) requires a heap fetch.

Delaying dropping the leaf page for long enough to cache visibility
info from the VM for the page's items gives us "just enough of an
interlock" against unsafe concurrent TID recycling by VACUUM. When the
index-only scan has to do heap fetches, they'll be correct for the
same reason that they're already correct with plain index scans (and
bitmap index scans). When it does not, we know that we're working off
of consistent-with-index visibility info for the TIDs in question.

> Btree scans without lock coupling have been in the works for a while
> now, this change would add some final icing to the cake.

I think that it's good that this makes the rules for index-only scans
as close as possible to those for plain index scans. And that it makes
the rules simple.

> > Dropping leaf page buffer pins during scans of an unlogged relation
> > -------------------------------------------------------------------

> > We solve that problem by introducing GiST style "fake LSNs" to both
> > ntbree and hash. Now the same LSN trick works for unlogged relations,
> > too.
>
> It looks like this 'fake LSN' in nbtree is updated every time the page
> is modified (including on every insert), instead of just every time
> tuples are removed.
>
> Strictly speaking, only the latter case would need (fake) LSN updates
> in unlogged relations where we're marking tuples dead; there is no TID
> that has been removed and re-inserted with different values; and it'd
> be a fairly cheap micro-optimization that's useful for indexes with
> high page-level churn.

I'm fairly sure that you could make almost the same argument against
the way that fake LSNs work in GiST right now. The NSN thing likely
doesn't require that every single atomic operation get a new fake LSN
within an unlogged index. And yet that's how GiST does it.

> However, nbtree (and presumably hash too) only uses page LSNs to check
> whether it can safely set dead bits, and for that to work correctly we
> only need to invalidate the LSN associated with the cached TID lists
> when a TID is removed from the page; either through vacuum or through
> page splits or other tuple movement.

I agree that it would probably be correct to only use a fake LSN
within nbtree VACUUM (and hash VACUUM). However, I don't think it's a
good idea to be selective about the use of fake LSNs. I think that
fake LSNs should seamlessly behave like real LSNs, from the point of
view of any code (actual or hypothetical) that wants to use LSNs to
reason about concurrent changes.

For one thing, we might actually want to do something a bit like GiST
does within nbtree in the future: to store LSNs on a descent stack, as
a way of determining if pages have changed from before. That could
help with SAOP scans that cache binary search state on internal pages,
and reuse it across index searches/descents. More importantly,
selectively implementing fake LSNs (by only doing the bare minimum
needed by _bt_kililtems) seems likely to make testing and validation
harder.

Besides all this, the amount of code added by the nbtree fake LSN
patch is under 100 lines.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2026-01-20 20:45:44 Re: Mystery with REVOKE PRIVILEGE
Previous Message Peter Eisentraut 2026-01-20 20:39:26 Re: [PATCH] Fix severe performance regression with gettext 0.20+ on Windows