| From: | surya poondla <suryapoondla4(at)gmail(dot)com> |
|---|---|
| To: | cca5507 <cca5507(at)qq(dot)com> |
| Cc: | pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: [BUG] Take a long time to reach consistent after pg_rewind |
| Date: | 2026-05-19 21:20:56 |
| Message-ID: | CAOVWO5p7CNqx-j+M37V9ncbVZ-P5TacNekX9FbmqUG1Ene59cA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind
Hi ChangAo,
Thanks for the careful diagnosis, I reproduced the hang on macOS on the
latest postgres code (It took a lot of iterations to reproduce it)
The LSN trace matches your description and I saw the below:
minRecoveryPoint = 0/08000028
consistent recovery state reached at = 0/08000060
In my run the standby was stuck for ~9 s; consistency was eventually
declared at 0/08000060 because a small upstream record (most likely
a RUNNING_XACTS snapshot from bgwriter) landed at 0/08000028 and let
lastReplayedEndRecPtr leap past the bad finish line.
With the new primary stopped after pg_rewind, the wait was unbounded as
expected.
Regarding the fix: the underlying issue is that minRecoveryPoint is
implicitly expected to be the end-LSN of a real WAL record, because
lastReplayedEndRecPtr (the value it gets compared against)
can only ever take such values. All current writers respect this
expectation except pg_rewind: pg_basebackup uses the backup-end record's
EndRecPtr, and the in-running UpdateMinRecoveryPoint path
uses buffer LSNs, both of which are record-end LSNs by construction.
pg_rewind alone uses pg_current_wal_insert_lsn(), which can return a
position just past a page header when the source is idle.
That's why I'd lean toward fixing the producer (pg_rewind).
Concretely, your original suggestion having pg_rewind use
GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), restores
the invariant globally, and doesn't require future call sites that compare
against minRecoveryPoint to know about page-header adjustments.
If we still want a defense-in-depth guard in CheckRecoveryConsistency() to
handle older pg_rewind binaries running against a newer server,
the v1 patch is on the right track, but I'd suggest:
- documenting in the helper comment why exactly SizeOfXLogShortPHD /
SizeOfXLogLongPHD past a page boundary are the only legal
"non-record-end" minRecoveryPoint values (i.e. who can produce
them and under what conditions);
- auditing the other call sites that compare against
minRecoveryPoint to confirm none of them needs the same
adjustment, with a comment recording the conclusion.
I can put together a TAP test under src/bin/pg_rewind/t/ that forces a WAL
switch on the source, runs pg_rewind against an
otherwise-idle primary, and asserts that the rewound node reaches
consistency without further upstream activity.
Happy to send a v2 with that test if useful.
This is a liveness bug with potentially unbounded wait on idle promoted
primaries, so it does seem worth back-patching.
Regards,
Surya Poondla
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-05-19 23:36:32 | Re: Fix SPLIT PARTITION bound-overlap bug and other improvements |
| Previous Message | Daniel Gustafsson | 2026-05-19 21:18:08 | Re: PostgreSQL and OpenSSL 4.0.0 |