| From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
|---|---|
| To: | cca5507(at)qq(dot)com |
| Cc: | suryapoondla4(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [BUG] Take a long time to reach consistent after pg_rewind |
| Date: | 2026-06-11 06:35:32 |
| Message-ID: | 20260611.153532.119751043006159033.horikyota.ntt@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello.
At Thu, 11 Jun 2026 11:03:17 +0800, "cca5507" <cca5507(at)qq(dot)com> wrote in
> I get a new idea to fix this bug from:
>
> https://www.postgresql.org/message-id/05DEB465-3C99-417E-B7FA-275A28068D90%40yandex-team.ru
>
> Please see the v2 patch for details.
Thanks for the patch.
However, I am not sure that using the flush LSN is the right
direction.
As I understand it, the root of the problem is that
get_current_wal_insert_lsn() returns the next WAL insertion position,
while minRecoveryPoint is expected to point just past the end of the
last required record. At positions immediately following a WAL page or
segment header, those locations can be logically equivalent but
numerically different.
I think pg_rewind is probably using the insert LSN because it wants to
choose a conservative position as far ahead as possible. It might be
possible to use the flush LSN if the copying logic is carefully
arranged, but I would prefer to keep using the insert LSN if we can.
I also think it is unfortunate to require client-side tools to be
aware of the exact "end of last record + 1" representation. If both
representations are logically equivalent at those positions, I would
rather have the recovery side accept and normalize them than require
every tool to produce exactly the expected form.
So I wonder whether we should instead normalize minRecoveryPoint when
reading it from the control file, so that these equivalent
representations are treated as the same position. That would also make
the recovery side a bit more robust.
For example, something like this:
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 73b78a83fa7..660c681795d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -955,6 +955,20 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
{
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+
+ /*
+ * minRecoveryPoint in the control file is expected to point to the
+ * location immediately following the end of the target
+ * record. However, some tools may record a location immediately after
+ * a WAL segment or page header when the target position is at the
+ * beginning of a segment or page. Normalize such values here before
+ * further processing.
+ */
+ if (XLogSegmentOffset(minRecoveryPoint, wal_segment_size)
+ == SizeOfXLogLongPHD)
+ minRecoveryPoint -= SizeOfXLogLongPHD;
+ else if (minRecoveryPoint % XLOG_BLCKSZ == SizeOfXLogShortPHD)
+ minRecoveryPoint -= SizeOfXLogShortPHD;
}
else
{
Thoughts?
--
Kyotaro Horiguchi
NTT Open Source Software Center
| From | Date | Subject | |
|---|---|---|---|
| Next Message | shveta malik | 2026-06-11 06:38:13 | Re: [PATCH] Preserve replication origin OIDs in pg_upgrade |
| Previous Message | solai v | 2026-06-11 06:35:01 | Re: Add pg_stat_kind_info system view |