Re: BUG: Former primary node might stuck when started as a standby

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: 'Alexander Lakhin' <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Aleksander Alekseev <aleksander(at)timescale(dot)com>
Subject: Re: BUG: Former primary node might stuck when started as a standby
Date: 2026-03-04 05:31:29
Message-ID: aafDsb5snkfkNfdS@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 03, 2026 at 09:17:16AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Thanks for the info. So I can provide the patch after the issue for 009_twophase.pl
> is fixed. For better understanding we may be able to fork new
> thread.

Regarding your posted v4, I am actually not convinced that there is a
need for injection points and disabling standby snapshots, for the
three sequences of tests proposed.

While the first wait_for_replay_catchup() can be useful before the
teardown_node() of the primary in the "Check that prepared
transactions can be committed on promoted standby" sequence, it still
has a limited impact. It looks like we could have other parasite
records as well, depending on how slowly the primary is stopped? I
think that we should switch to a plain stop() of the primary, the test
wants to check that prepared transactions can be committed on a
standby. Stopping the primary abruptly does not matter for this
sequence.

For the second wait_for_replay_catchup(), after the PREPARE of
xact_009_11. I may be missing something but in how does it change
things? A plain stop() of the primary means that it would have
received all the WAL records from the primary on disk in its pg_wal,
no? Upon restart, it should replay everything it finds in pg_wal/. I
don't see a change required here.

For the third wait_for_replay_catchup(), after the PREPARE of
xact_009_12, same dance. The primary is cleanly stopped first. All
the WAL records of the primary should have been flushed to the
standby.

As a whole, it looks like we should just switch the teardown() call to
a stop() call in the first test with xact_009_10, backpatch it, and
call it a day. No need for injection points and no need for GUC
tweaks. I have not looked at 004_timeline_switch yet.

> I guess so. cluster::stop does the `pg_ctl stop -m fast` command. In this case
> the walsender waits till there are nothing to be sent, see WalSndLoop().
> Do let me know if you have observed the similar failure here.

Exactly. Doing a clean stop of the primary offers a strong guarantee
here. We are sure that the standby will have received all the records
from the primary. Timeline forking is an impossible thing in
012_subtransactions.pl based on how the switchover from the primary to
the standby happens. I don't see a need for tweaking this test at
all. Or perhaps you did see a failure of some kind in this test,
Alexander?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Xuneng Zhou 2026-03-04 05:33:21 Re: Refactor recovery conflict signaling a little
Previous Message David Steele 2026-03-04 05:11:48 Re: Improve checks for GUC recovery_target_xid