| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
| Cc: | Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Subject: | Re: index prefetching |
| Date: | 2026-01-19 19:45:40 |
| Message-ID: | CAEze2WiwCNu1jZmKa_gvTswwAegh3FaWu3gO_FiMgxCbEOBx8g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, 19 Jan 2026 at 00:52, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
[...]
>
> Dropping leaf page buffer pins during index-only scans
> ------------------------------------------------------
>
> I realized (following a point made by Matthias van de Meent during a
> private call) that it simply isn't necessary for index pages to hold
> on to buffer pins, in general. We haven't actually done that with
> nbtree plain index scans with MVCC snapshots since 2015's commit
> 2ed5b87f, which added what we now call nbtree's dropPin optimization.
> What if we could find a way to teach *every* amgetbatch-supporting
> index AM to do the same trick for *all* scans (barring non-MVCC
> snapshot scans)? Including index-only scans and scans of unlogged
> relations? Then we'd have zero chance of unintended interactions with
> the read stream; there'd simply be no extra buffer pins that might
> confuse the read stream in the first place!
>
> Matthias told me that his patch to fix the bugs in GiST index-only
> scans works by not dropping a pin on a GiST leaf page right away. It
> delays dropping such pins, but only very slightly: if we cache
> visibility information from the VM (which we're doing already in the
> amgetbatch patch, and which Matthias' patch does too), and delay
> dropping a batch's leaf page pin until after its VM cache is loaded,
> it reliably avoids races of the kind that we need to be worried about
> here.
Correct: My patch does the VM lookups that are currently being done by
IndexOnlyNext just before releasing the index page, and eventually
passes on that VM lookup result to IndexOnlyNext; allowing the scan to
drop the pin on the index page without opening up to the mentioned
race condition with vacuum.
One major difference between my patch and this patch, is that in my
patch, the index AM itself asks the tableAM to do visibility checks on
a set of TIDs, at a point where the index AM would like to drop the
pin.
With your v8-0003, the index doesn't have access to such an API
(instead, heap_batch_getnext does VM checks), which makes it more
difficult for index AMs to implement the batch API when it can't
return tuples one page at a time (like GiST, with ordered scans). If
heap_batch_resolve_visibility (or my patch's
table_index_vischeck_tuples() API, see v14-0003) were exposed for
indexes to apply, and if the batch API allowed the index to drop the
pin instead of requiring the tableAM to do that, then that'd be a huge
enabler for an implementation of amgetbatch() for GiST/SP-GiST.
[...]
> The race in question involves VACUUM concurrently setting a VM page
> all visible on a heap page with a TID that is also recycled by VACUUM
> (as it sets its page all-visible). We can safely allow VACUUM to go
> ahead with this while still dropping our pin early -- provided we build our
> local cache of visibility information first. Holding on to a leaf page
> pin while reading from the VM suffices. The important principle is
> that our local cache of VM info is (and will remain) consistent with
> what we saw on the index page when we saved its matching TIDs into a
> batch. (It doesn't matter that we do heap fetches for now-all-visible
> pages, because they cannot possibly be visible to the scan's MVCC
> snapshot. Just like in the plain index scan dropPin case. And rather
> like bitmap index scans.)
Correct. One thing to note here is that this doesn't mean you can take
the VM results you got whilst reading one index page and use them for
the tuples of another index page; it'd likely cause issues if this
isn't tracked correctly.
> v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch has a
> new isolation test that demonstrates the new "drop pins eagerly during
> index-only scans" behavior, which is named
> index-only-scan-visibility.spec. The isolation test is a variant of
> the one I posted on the GiST thread, which proved that GiST is broken
> here (a problem that Matthias is working on fixing). If you attempt to run this
> isolation test on master, it'll block forever; VACUUM can never acquire a
> cleanup lock due to a conflicting buffer pin held by an index-only scan.
> That doesn't happen with v8 of the patch, though; it completes in less than
> 20ms on my system (and the scan actually gives correct results!).
Btree scans without lock coupling have been in the works for a while
now, this change would add some final icing to the cake.
Do check out the GiST/SP-GiST bugfix patchset at [0], in which there
is more infrastructure that reduces VM accesses and vm buffer churn to
a minimum.
> This still leaves non-MVCC snapshot scans. There's nothing we can do
> to avoid holding on to a leaf page buffer pin while accessing the heap
> there. But that's okay; now we just refuse to do I/O prefetching
> during such scans.
>
> Dropping leaf page buffer pins during scans of an unlogged relation
> -------------------------------------------------------------------
>
> Another thing that hinders nbtree's dropPin optimization (and that we
> must deal with to get a guarantee that leaf page buffer pins never
> really need to be kept around) is the use of an unlogged relation.
> That breaks dropPin's approach to detecting unsafe concurrent TID
> recycling when marking dead items LP_DEAD on index pages, since that
> involves stashing a page LSN, and then checking if it has changed
> later on.
>
> We solve that problem by introducing GiST style "fake LSNs" to both
> ntbree and hash. Now the same LSN trick works for unlogged relations,
> too.
It looks like this 'fake LSN' in nbtree is updated every time the page
is modified (including on every insert), instead of just every time
tuples are removed.
Strictly speaking, only the latter case would need (fake) LSN updates
in unlogged relations where we're marking tuples dead; there is no TID
that has been removed and re-inserted with different values; and it'd
be a fairly cheap micro-optimization that's useful for indexes with
high page-level churn.
Context: GiST uses LSNs for more than just detecting if it's safe to
mark its tuples dead: it also uses LSNs in its descent algorithm (see
"findPath" in src/backend/access/gist/README). As such, it has to
update the page LSN more frequently than nbtree, and I think there are
similar considerations in SP-GiST.
However, nbtree (and presumably hash too) only uses page LSNs to check
whether it can safely set dead bits, and for that to work correctly we
only need to invalidate the LSN associated with the cached TID lists
when a TID is removed from the page; either through vacuum or through
page splits or other tuple movement.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2026-01-19 19:59:57 | Re: pg_plan_advice |
| Previous Message | Andrei Lepikhov | 2026-01-19 19:44:56 | Re: Add rows removed by hash join clause to instrumentation |