| From: | Michael Paquier <michael(at)paquier(dot)xyz> |
|---|---|
| To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
| Cc: | 'Alexander Lakhin' <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Aleksander Alekseev <aleksander(at)timescale(dot)com> |
| Subject: | Re: BUG: Former primary node might stuck when started as a standby |
| Date: | 2026-03-04 05:31:29 |
| Message-ID: | aafDsb5snkfkNfdS@paquier.xyz |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Mar 03, 2026 at 09:17:16AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Thanks for the info. So I can provide the patch after the issue for 009_twophase.pl
> is fixed. For better understanding we may be able to fork new
> thread.
Regarding your posted v4, I am actually not convinced that there is a
need for injection points and disabling standby snapshots, for the
three sequences of tests proposed.
While the first wait_for_replay_catchup() can be useful before the
teardown_node() of the primary in the "Check that prepared
transactions can be committed on promoted standby" sequence, it still
has a limited impact. It looks like we could have other parasite
records as well, depending on how slowly the primary is stopped? I
think that we should switch to a plain stop() of the primary, the test
wants to check that prepared transactions can be committed on a
standby. Stopping the primary abruptly does not matter for this
sequence.
For the second wait_for_replay_catchup(), after the PREPARE of
xact_009_11. I may be missing something but in how does it change
things? A plain stop() of the primary means that it would have
received all the WAL records from the primary on disk in its pg_wal,
no? Upon restart, it should replay everything it finds in pg_wal/. I
don't see a change required here.
For the third wait_for_replay_catchup(), after the PREPARE of
xact_009_12, same dance. The primary is cleanly stopped first. All
the WAL records of the primary should have been flushed to the
standby.
As a whole, it looks like we should just switch the teardown() call to
a stop() call in the first test with xact_009_10, backpatch it, and
call it a day. No need for injection points and no need for GUC
tweaks. I have not looked at 004_timeline_switch yet.
> I guess so. cluster::stop does the `pg_ctl stop -m fast` command. In this case
> the walsender waits till there are nothing to be sent, see WalSndLoop().
> Do let me know if you have observed the similar failure here.
Exactly. Doing a clean stop of the primary offers a strong guarantee
here. We are sure that the standby will have received all the records
from the primary. Timeline forking is an impossible thing in
012_subtransactions.pl based on how the switchover from the primary to
the standby happens. I don't see a need for tweaking this test at
all. Or perhaps you did see a failure of some kind in this test,
Alexander?
--
Michael
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Xuneng Zhou | 2026-03-04 05:33:21 | Re: Refactor recovery conflict signaling a little |
| Previous Message | David Steele | 2026-03-04 05:11:48 | Re: Improve checks for GUC recovery_target_xid |