TID recycling race during nbtree index-only scans that run on a standby

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Álvaro Herrera <alvherre(at)kurilemu(dot)de>
Subject: TID recycling race during nbtree index-only scans that run on a standby
Date: 2026-06-17 22:16:43
Message-ID: CAH2-Wzkro678xEDOrsxrq-YDVP4w7nXybhVJV2MxrzTj-5OU4w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Recently, I've been thinking a lot about the interlocking protocol
that prevents wrong answers during index-only scans, even with
concurrent TID recycling (since it is relevant to the index
prefetching work). I'm referring to the way index-only scans generally
hold a buffer pin on their scan's current leaf page position, which
will conflict with the cleanup locks that index vacuuming acquires on
every leaf page.

There are historical (and current) bugs that share the same basic
shape. There are live bugs in GiST and SP-GiST index only scans [1]. A
similar bug also affected bitmap scans, which previously used
information from the visibility map as an optimization (before we
fixed that bug) [2].

Unfortunately, there appears to be yet another bug of that general nature.

The bug in question affects nbtree index-only scans running during hot
standby; these scans can see "phantom" resurrected rows when VACUUM
recycles stub LP_DEAD line pointers in heap pages sooner than is safe.
The LP_DEAD stubs are needed as tombstones, but VACUUM can sometimes
win the race and mark them LP_UNUSED prematurely -- also setting the
relevant heap page all-visible in the VM. In other words, this bug's
general symptoms match those of the other bugs I mentioned.

This is only possible because the standby won't acquire cleanup locks
on *every* index page -- unlike during original execution. It will
only cleanup lock whatever index pages actually had one or more index
tuples removed during VACUUM, which isn't quite good enough. In other
words, the rationale for removing the "pin scan" logic in commits
f65b94f6, 3e4b7d87, and 687f2cd7 was subtly flawed in that it didn't
consider index-only scans, which are legitimately a special case.

Attached are 2 patches, both intended to show the general nature of the problem.

The first patch is a repro written by Claude code at my direction;
there are many tedious and fiddly details involved that aren't worth
discussing now. Multiple test cases show wrong answers, allowing the
bug to manifest in several different ways (delete+commit vs
insert+abort, page split vs index deletion).

The second patch resurrects the old "pin scan" logic into modern
nbtree, making the failing tests pass, and confirming my understanding
of the problem.

Neither patch is committable. The pin scan mechanism performed
terribly, and I cannot countenance actually bringing it back now. I
haven't yet given much thought to how we can fix this bug without
causing more harm than good. The rationale for removing the "pin scan"
logic was *almost* correct back in 2016; we simply failed to consider
how index-only scans are a special case (which, crucially, wasn't
documented anywhere in 2016, and still isn't today).

[1] https://postgr.es/m/CAH2-Wz=jjiNL9FCh8C1L-GUH15f4WFTWub2x+_NucngcDDcHKw@mail.gmail.com
[2] Fixed by April 2025 commit 459e7bf8
--
Peter Geoghegan

Attachment Content-Type Size
v1-0001-Add-a-hot-standby-index-only-scan-TID-recycling-r.patch application/octet-stream 22.8 KB
v1-0002-nbtree-resurrect-the-recovery-side-pin-scan-for-V.patch application/octet-stream 9.1 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-06-17 22:23:00 Re: Fix tuple deformation with virtual generated NOT NULL columns
Previous Message Jacob Champion 2026-06-17 22:11:02 Re: Direction for test frameworks: Perl TAP vs. Python/pytest