| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Álvaro Herrera <alvherre(at)kurilemu(dot)de> |
| Subject: | Re: TID recycling race during nbtree index-only scans that run on a standby |
| Date: | 2026-06-22 11:38:07 |
| Message-ID: | CAEze2WighSz7jbnw4L0KgQEBg_4+HJhDKdYwQTgeyDTXHwFvfQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, 18 Jun 2026 at 00:17, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> Recently, I've been thinking a lot about the interlocking protocol
> that prevents wrong answers during index-only scans, even with
> concurrent TID recycling (since it is relevant to the index
> prefetching work). I'm referring to the way index-only scans generally
> hold a buffer pin on their scan's current leaf page position, which
> will conflict with the cleanup locks that index vacuuming acquires on
> every leaf page.
> The bug in question affects nbtree index-only scans running during hot
> standby; these scans can see "phantom" resurrected rows when VACUUM
> recycles stub LP_DEAD line pointers in heap pages sooner than is safe.
> The LP_DEAD stubs are needed as tombstones, but VACUUM can sometimes
> win the race and mark them LP_UNUSED prematurely -- also setting the
> relevant heap page all-visible in the VM.
All that, for queries active on that standby, right? This isn't an
issue about the primary, but rather one about WAL replay not currently
providing the right interlocks?
> In other words, this bug's
> general symptoms match those of the other bugs I mentioned.
>
> This is only possible because the standby won't acquire cleanup locks
> on *every* index page -- unlike during original execution. It will
> only cleanup lock whatever index pages actually had one or more index
> tuples removed during VACUUM, which isn't quite good enough. In other
> words, the rationale for removing the "pin scan" logic in commits
> f65b94f6, 3e4b7d87, and 687f2cd7 was subtly flawed in that it didn't
> consider index-only scans, which are legitimately a special case.
>
> Attached are 2 patches, both intended to show the general nature of the problem.
>
> The first patch is a repro written by Claude code at my direction;
> there are many tedious and fiddly details involved that aren't worth
> discussing now. Multiple test cases show wrong answers, allowing the
> bug to manifest in several different ways (delete+commit vs
> insert+abort, page split vs index deletion).
>
> The second patch resurrects the old "pin scan" logic into modern
> nbtree, making the failing tests pass, and confirming my understanding
> of the problem.
>
> Neither patch is committable. The pin scan mechanism performed
> terribly, and I cannot countenance actually bringing it back now. I
> haven't yet given much thought to how we can fix this bug without
> causing more harm than good. The rationale for removing the "pin scan"
> logic was *almost* correct back in 2016; we simply failed to consider
> how index-only scans are a special case (which, crucially, wasn't
> documented anywhere in 2016, and still isn't today).
I have a few thoughts:
There are two bugs here, both arriving from the same issue of removing
dead items from the page without pin interlock:
1. page splits move cached dead items away from the page, allowing
them to be cleaned up from another page without the original page
(which may still have pins) being involved, and
2. replay of XLOG_BTREE_DELETE doesn't take a cleanup lock (which
provides the "pin interlock") on the page.
For the second case, obviously, we should use a cleanup lock for
XLOG_BTREE_DELETE.
For the first case, this probably also would be handled if we took a
cleanup lock on the original page during replay - the operations is
removing tuples from that page, and we don't yet know if they're going
to be recycled soon (though that is unlikely given page split's
efforts to avoid keeping dead tuples on the page, it is possible). See
attached (on top of your v1-0001).
Alternatively, we could hold normal read locks on the page on
replica's IOS while the scan needs that page. That'd risk getting
really bad performance in the normal case, though, where page
splits/cleanup happen much more rarely than insertions; while those
inserts also require interlocked access with page reads.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
PS. This once again reminds me, that with all the new buffer pin
-holding systems in AIO we probably should have a 'hurry up; I need to
access this buffer and you're blocking my very important job' signal
system. Cursors (and other queries) that hold on to IOS' required
pins only because it decided VM checks should happen very lazily
should then be able to run those VM checks immediately on receiving
that signal, instead of having to wait for the query state to continue
on past the current page.
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Add-a-hot-standby-index-only-scan-TID-recycling-r.patch | application/octet-stream | 22.8 KB |
| v2-0002-nbtree-Use-cleanup-locks-in-redo-for-IOS-interloc.patch | application/octet-stream | 2.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Sharma | 2026-06-22 11:54:30 | Re: Report bytes and transactions actually sent downtream |
| Previous Message | Tatsuya Kawata | 2026-06-22 11:34:12 | Re: [PATCH] doc: clarify pg_stat_lock.fastpath_exceeded scope |