Re: [PATCH] Fix PITR pause bypass when initial XLOG_RUNNING_XACTS has subxid overflow

From: Jan Nidzwetzki <jnidzwetzki(at)gmx(dot)de>
To: Matt Blewitt <mble(at)planetscale(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Fix PITR pause bypass when initial XLOG_RUNNING_XACTS has subxid overflow
Date: 2026-06-12 09:59:04
Message-ID: FA11936A-AC9D-4840-840F-99335EED16E8@gmx.de
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Hackers,

This is a follow-up to Matt Blewitt's report and patch from February [1], which identified the following bug: when the first XLOG_RUNNING_XACTS record a standby replays has subxid_overflow set, standbyState gets stuck at STANDBY_SNAPSHOT_PENDING, and hot standby is never activated. As a consequence, recovery_target_action = 'pause' is silently ignored: recoveryPausesHere() returns immediately because !LocalHotStandbyActive, the PAUSE case falls through, and the server promotes instead of pausing.

We'd like to propose an alternative fix for the same problem and describe why we believe serving read-only queries in this state is safe, and why we deliberately do not advance standbyState to STANDBY_SNAPSHOT_READY as the earlier patch did.

Patches
=======
0001 - Behavior-preserving refactor: pull the connection-enabling block
out of CheckRecoveryConsistency() into a small helper.

0002 - The fix: call EnableHotStandbyConnections() from the
RECOVERY_TARGET_ACTION_PAUSE path, just before recoveryPausesHere(), and add a TAP test.

Why we believe enabling reads is correct
=======

The reason a standby normally refuses queries from an overflowed snapshot is the risk of an incorrect visibility decision for a subtransaction whose top-level transaction is still running on the primary.

When the initial RUNNING_XACTS is overflowed, KnownAssignedXids may be missing some subxids. For such a recovery snapshot, XidInMVCCSnapshot() first maps the xid to its topmost parent via SubTransGetTopmostTransaction() and then looks it up in the in-progress set. If that mapping is not available in pg_subtrans and the xid is not present in KnownAssignedXids, the xid is treated as "not in the snapshot", and the final committed/aborted decision is delegated to TransactionIdDidCommit() in HeapTupleSatisfiesMVCC().

During active WAL replay, this is the dangerous case: a subxid S of a still-running top transaction T may have its row present on disk; if S cannot be resolved as in-progress and T's commit record is later replayed, CLOG flips T to committed, and a query could suddenly see a row that was not committed as of its snapshot's xmax. That is exactly why connections are withheld until a non-overflowed snapshot (STANDBY_SNAPSHOT_READY) gives complete knowledge.

At an end-of-recovery pause, this hazard disappears because replay is frozen. The only ways out of recoveryPausesHere(true) are promotion and shutdown. A pg_wal_replay_resume() at the end of recovery falls through to promotion rather than resuming replay.

Therefore, no commit record for any in-progress transaction will ever be replayed, so CLOG cannot transition T (or its subxids) to committed after this point. TransactionIdDidCommit() for such an xid stays false forever. So, the MVCC visibility fallback keeps the row invisible.

In short, the set of transactions a query can observe as committed is now stable, and is exactly the set that committed before replay stopped. The single condition that makes overflowed-snapshot reads unsafe during live replay (a transaction that was in progress as of a snapshot later being observed as committed) cannot arise once replay halts. So the pending snapshot, while still overflowed, yields correct and stable visibility.

Why we keep standbyState at STANDBY_SNAPSHOT_PENDING
=======

The earlier patch forced standbyState to STANDBY_SNAPSHOT_READY at the pause point and re-ran CheckRecoveryConsistency(). We chose not to do that.

STANDBY_SNAPSHOT_READY means we have full knowledge of the transactions that were running on the primary and that snapshots are complete and need not be treated as overflowed. That is not true here since the snapshot is still overflowed (visibility stays correct for the frozen-replay reason above, not because the snapshot has become complete). Forcing the state to READY would assert something false about the recovery state.

Original report and first patch by Matt Blewitt. Thanks also for the analysis in that thread.

Thoughts welcome.

[1] https://www.postgresql.org/message-id/CACy-Nv24ZORVN9_S_yHF5Nsip45HKCBtKVNC3XdKgz%2B1wvGvEQ%40mail.gmail.com

Best Regards
Jan Nidzwetzki
On behalf of PlanetScale

Attachment Content-Type Size
0001-Refactor-extract-EnableHotStandbyConnections-helper.patch application/octet-stream 3.4 KB
0002-Honor-recovery_target_action-pause-on-inconsistent-s.patch application/octet-stream 9.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2026-06-12 10:00:52 Re: Why our Valgrind reports suck
Previous Message Antonin Houska 2026-06-12 09:52:12 Re: REPACK CONCURRENTLY fails on tables with generated columns