Re: [BUG] Take a long time to reach consistent after pg_rewind

From: surya poondla <suryapoondla4(at)gmail(dot)com>
To: cca5507 <cca5507(at)qq(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind
Date: 2026-05-19 21:20:56
Message-ID: CAOVWO5p7CNqx-j+M37V9ncbVZ-P5TacNekX9FbmqUG1Ene59cA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind

Hi ChangAo,

Thanks for the careful diagnosis, I reproduced the hang on macOS on the
latest postgres code (It took a lot of iterations to reproduce it)
The LSN trace matches your description and I saw the below:
minRecoveryPoint = 0/08000028
consistent recovery state reached at = 0/08000060

In my run the standby was stuck for ~9 s; consistency was eventually
declared at 0/08000060 because a small upstream record (most likely
a RUNNING_XACTS snapshot from bgwriter) landed at 0/08000028 and let
lastReplayedEndRecPtr leap past the bad finish line.
With the new primary stopped after pg_rewind, the wait was unbounded as
expected.

Regarding the fix: the underlying issue is that minRecoveryPoint is
implicitly expected to be the end-LSN of a real WAL record, because
lastReplayedEndRecPtr (the value it gets compared against)
can only ever take such values. All current writers respect this
expectation except pg_rewind: pg_basebackup uses the backup-end record's
EndRecPtr, and the in-running UpdateMinRecoveryPoint path
uses buffer LSNs, both of which are record-end LSNs by construction.
pg_rewind alone uses pg_current_wal_insert_lsn(), which can return a
position just past a page header when the source is idle.
That's why I'd lean toward fixing the producer (pg_rewind).

Concretely, your original suggestion having pg_rewind use
GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), restores
the invariant globally, and doesn't require future call sites that compare
against minRecoveryPoint to know about page-header adjustments.

If we still want a defense-in-depth guard in CheckRecoveryConsistency() to
handle older pg_rewind binaries running against a newer server,
the v1 patch is on the right track, but I'd suggest:
- documenting in the helper comment why exactly SizeOfXLogShortPHD /
SizeOfXLogLongPHD past a page boundary are the only legal
"non-record-end" minRecoveryPoint values (i.e. who can produce
them and under what conditions);

- auditing the other call sites that compare against
minRecoveryPoint to confirm none of them needs the same
adjustment, with a comment recording the conclusion.

I can put together a TAP test under src/bin/pg_rewind/t/ that forces a WAL
switch on the source, runs pg_rewind against an
otherwise-idle primary, and asserts that the rewound node reaches
consistency without further upstream activity.
Happy to send a v2 with that test if useful.

This is a liveness bug with potentially unbounded wait on idle promoted
primaries, so it does seem worth back-patching.

Regards,
Surya Poondla

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-05-19 23:36:32 Re: Fix SPLIT PARTITION bound-overlap bug and other improvements
Previous Message Daniel Gustafsson 2026-05-19 21:18:08 Re: PostgreSQL and OpenSSL 4.0.0