Quick Links

Re: [PATCH] Fix pg_rewind false positives caused by shutdown-only WAL

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [PATCH] Fix pg_rewind false positives caused by shutdown-only WAL
Date:	2025-09-25 19:19:55
Message-ID:	CA+TgmoYDgcV8viFQNLHYyC2Zct4sRsOf99rWWzjt6k3Z2hpNig@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Sep 6, 2025 at 12:34 PM Srinath Reddy Sadipiralla
<srinath2133(at)gmail(dot)com> wrote:
> While working with pg_rewind, I noticed that it can sometimes request a rewind even when no actual changes exist after a failover.
>
> Problem:
> Currently, pg_rewind determines the end-of-WAL on the target by using the last shutdown checkpoint (or minRecoveryPoint for a standby). This creates a false positive scenario:
>
> 1)Suppose a standby is promoted to become the new primary.
> 2)Later, the old primary is cleanly shut down.
> 3)The only WAL record generated on the old primary after divergence is a shutdown checkpoint.
>
> At this point, the old primary and new primary contain identical data.

This isn't true, because the control file changes when the shutdown
checkpoint is written.

Also, I think if you tested this even once, you'd find out that it
doesn't actually work. At least, it doesn't work for me, and I don't
see how it would work for you. I created a primary. Then I made a
standby. Then I promoted the standby while doing a clean shutdown of
the primary. Then I tried to restart the primary as a standby, and I
get an infinite loop like this:

2025-09-25 15:07:03.094 EDT [58031] FATAL: terminating walreceiver
process due to administrator command
2025-09-25 15:07:03.094 EDT [56485] LOG: new timeline 2 forked off
current database system timeline 1 before current recovery point
0/040000D8

Now, admittedly, I didn't run pg_rewind here. But if I understand
correctly, your patch's idea is to have pg_rewind say that everything
is fine in this scenario. What it won't do is make postgres itself
agree with that conclusion. I'm actually not sure the patch would have
achieved that goal here, because the WAL actually looks like this:

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/04000028, prev 0/03000120, desc: RUNNING_XACTS nextXid 754
latestCompletedXid 753 oldestRunningXid 754
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/04000060, prev 0/04000028, desc: CHECKPOINT_SHUTDOWN redo
0/04000060; tli 1; prev tli 1; fpw true; wal_level replica; xid 0:754;
oid 14029; multi 1; offset 0; oldest xid 746 in DB 1; oldest multi 1
in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid
0; shutdown

Your patch doesn't seem to contain any logic to ignore RUNNING_XACTS,
just CHECKPOINT_SHUTDOWN. So in this scenario it might have still
thought a rewind was necessary. But I think this test case still makes
the point that you can't just change pg_rewind if the server has a
different idea about how things work.

Also, the server is correct to be concerned about the control file
changing, and your patch is wrong to try to ignore such a change. It's
true that the control file is not a relation data file, but its
contents are very important, and updates to them matter when deciding
whether servers are in sync.

Also, even if the idea of your patch were correct, the logic it uses
to try to find an XLOG_CHECKPOINT_SHUTDOWN or XLOG_SWITCH record is
wrong. It's mandatory to first test XLogRecGetRmid(xlogreader) against
the appropriate RM_whatever_ID; see walsummarizer.c for an example of
how to do this correctly.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

[PATCH] Fix pg_rewind false positives caused by shutdown-only WAL at 2025-09-06 16:33:45 from Srinath Reddy Sadipiralla

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-09-25 19:36:18	Re: Avoiding roundoff error in pg_sleep()
Previous Message	Masahiko Sawada	2025-09-25 19:15:27	Re: POC: enable logical decoding when wal_level = 'replica' without a server restart