Re: Standby server with cascade logical replication could not be properly stopped under load

From: Ajin Cherian <itsajin(at)gmail(dot)com>
To: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Standby server with cascade logical replication could not be properly stopped under load
Date: 2025-05-22 09:48:03
Message-ID: CAFPTHDb7CY9X4FEwTY0yTRXE3k_xqGu8Q9dXiDkkAfgEMpUHGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, May 22, 2025 at 7:25 PM Alexey Makhmutov
<a(dot)makhmutov(at)postgrespro(dot)ru> wrote:
>
> Assuming following configuration with three connected servers A->B->C: A
> (primary), B (physical standby) and C (logical replica connected to B).
> If server A is under load and B is applying incoming WAL records while
> also streaming data via logical replication to C, then attempt to stop
> server B in 'fast' mode may by unsuccessful. In this case server will
> remain in PM_SHUTDOWN state indefinitely with all 'walsender' processes
> running in an infinite busy loop (consuming a CPU core each). To get
> server out of this state it's required either to either stop B using
> 'immediate' mode or stop server C (which will cause 'walsender'
> processes on server B to exit). This issue is reproducible on latest
> 'master', as well as on current PG 16/17 branches.
>
> Attached is a test scenario to reproduce the issue: 'test_scenario.zip'.
> This archive contains shell scripts to create
> the required environment (all three serves) and then to execute required
> steps to get server into incorrect state. First, edit the 'test_env.sh'
> file and specify path to PG binaries in PG_PATH variable and optionally
> set of ports used by test instances in 'pg_port' array. Then execute the
> 'test_prepare.sh' script, which will create, configure and start all
> three PG instances. Servers then could be started and stopped using
> corresponding start and stop scripts. To reproduce the issue, ensure
> that all three servers are running and execute the 'test_execute.sh'
> script. This script starts 'pgbench' instance in background for 30
> seconds to create load on server A, waits for 20 seconds and then try to
> stop the B instance using default 'fast' mode. Expected behavior is
> normal shutdown for B, while observed behavior is different: shutdown
> attempt fails and each remaining 'walsender' process consumes entire CPU
> core. To get out of this state one could use 'stop-C.sh' script to stop
> the server C, as it will complete shutdown process of B instance as well.
>
> My understanding is that this issue seems to be caused by the logic in
> 'GetStandbyFlushRecPtr' function, which returns current flush point for
> received WAL data. This position is used in 'XLogSendLogical' to
> calculate whether current walsender is in 'caught up' state (i.e.
> whether we send all the available data to downstream instance). During
> shutdown walsenders are allowed to continue their work until they are in
> 'caught up' state, while 'postmaster' is waiting for their completion.
> Currently 'GetStandbyFlushRecPtr' returns position of last stored
> record, rather than last applied record. This is correct for physical
> replication as we can send data to downstream instance without applying
> it to local system. However, for logical replication this seems to be
> incorrect, as we could not decode data until it's applied on current
> instance. So, if current stored WAL position differs from applied
> position while server is being stopped, then
> 'WalSndLoop'/'XLogSendLogical'/'XLogReadRecord' methods will spin in a
> busy loop, waiting for applied position to advance. The recovery process
> is already stopped at this moment, so this will be an infinite loop.
> Probably either 'GetStandbyFlushRecPtr' or
> 'WalSndLoop'/'XLogSendLogical' logic need to be adjusted to take into
> consideration such case with logical replication.
>
> Attached is also a patch, which aims to fix this issue:
> 0001-Use-only-replayed-position-as-target-flush-point-for.patch. It
> tries to to modify behavior of 'GetStandbyFlushRecPtr' function to
> ensure that it returns only applied position for logical replication.
> This function could be also invoked from slot synchronization routines
> and in this case it retains current behavior by returning last stored
> position.

Good catch, I agree with the analysis. I've also verified that the fix
works as expected.
Just a small comment: can you explicitly mention in the comments that
"for logical replication we can only send records that have already
been replayed else we might get stuck in shutdown" or something to
that effect as that distinction is important for future developers in
this area.

regards,
Ajin Cherian
Fujitsu Australia

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message shveta malik 2025-05-22 10:33:14 Re: Standby server with cascade logical replication could not be properly stopped under load
Previous Message Michael.Arlt@universa.de 2025-05-22 07:08:04 sudo dnf upgrade fails due to new llvm-libs-19.1.7-2.el9.x86_64