Re: [BUG] Take a long time to reach consistent after pg_rewind

From: surya poondla <suryapoondla4(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: cca5507(at)qq(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind
Date: 2026-06-15 23:22:06
Message-ID: CAOVWO5rqRQieJsQdWhPw0j=-zfp=A4aokfjOyT2-3dQyO1-zSw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks ChangAo for v2, and Kyotaro for working through the alternatives. I
think v2 is on the right track.

For me the Flush LSN looks correct for the below reasons:
1. The standby can reach it. XLOG_SWITCH (driven by archive_timeout) calls
XLogFlush(EndPos) with EndPos = segment boundary. xlogreader treats SWITCH
specially and advances NextRecPtr to end-of-segment
(XLogDecodeNextRecord()), so lastReplayedEndRecPtr lands on the segment
boundary and the consistency check fires. With v2, minRecoveryPoint =
segment boundary, which is exactly reachable.

This restores the implicit producer-side invariant: pg_basebackup's
XLOG_BACKUP_END EndRecPtr, UpdateMinRecoveryPoint() via
GetCurrentReplayRecPtr(), and the standby-source path in pg_rewind all
yield a value record-driven progress can match.

Caveat: XLogBackgroundFlush rounds WriteRqst.Write down to a page boundary,
and the asyncXactLSN bump only re-targets when we've already flushed past
it. On a busy primary the flush LSN can therefore sit mid-record, but
that's fine because live insertions advance the standby past whatever
endrec we pin. Not the scenario in this bug, just worth noting that no
producer-side or recovery-side fix eliminates every theoretical liveness
corner.

2. GetXLogInsertRecPtr() reads CurrBytePos from shared memory not durable.
If the source crashes after pg_rewind, it recovers to its flush LSN,
leaving the target's minRecoveryPoint pinned to a higher
insert LSN which is unreachable forever. flush LSN sits at the segment
boundary, which is on disk and recovery cannot regress past it. So this
approach is crash-safe

3. perform_rewind() does: copy files -> fetch source control file ->
get_current_wal_flush_lsn.
Any WAL-logged page on source's disk has page-LSN <= source's flush-LSN at
the time it was written (FlushBuffer enforces XLogFlush before eviction;
hint-bit pages with checksums use XLogSaveBufferForHint + PageSetLSN).
Flush LSN is monotonic, so the value dominates every copied page-LSN.
FSM/VM pages are exempt by design (freespace.c uses MarkBufferDirtyHint
without WAL) but aren't gated by minRecoveryPoint. Unlogged tables are out
of scope. Torn pages are closed by the existing full_page_writes=on
precondition.

Review comments
1. Commit message understates the fix. It only describes the page-header
symptom. The crash-safety property is the stronger argument and the one
that resolves the back-and-forth on which LSN to use. Suggest adding:
a. "Using the flush LSN is also crash-safe with respect to the source: the
insert LSN lives only in shared memory and can be lost on a source crash,
leaving the standby's minRecoveryPoint ahead of any LSN the source can
subsequently reach."
2. Code comment should explain why flush LSN is sufficient. The current "We
must replay to the last WAL flush location" doesn't say why. Suggest:
a. "Use the source's flush LSN as the target's minRecoveryPoint: every
WAL-logged page we copied has page-LSN <= source's flush LSN at copy time
(WAL-before-data), and flush LSN is monotonic. We avoid the insert LSN
because it can sit one page-header past a record's end at segment
boundaries (where no record will end), and it is not durable, a source
crash can leave flush LSN behind an insert LSN we already pinned."
3. Worth a comment in rewind_source.h that the callback must only be
invoked against a non-standby source, pg_current_wal_flush_lsn() errors out
under recovery.
4. No regression test. We can add a regression test under
src/bin/pg_rewind/t/.

Otherwise +1 on the v2 direction.

Regards,
Surya Poondla

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-06-15 23:35:30 Re: BM_IO_ERROR flag is lost in TerminateBufferIO due to order of operations in UnlockBufHdrExt
Previous Message Bharath Rupireddy 2026-06-15 23:21:31 Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE