From: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
---|---|
To: | Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
Subject: | Re: Standby server with cascade logical replication could not be properly stopped under load |
Date: | 2025-05-22 10:33:14 |
Message-ID: | CAJpy0uA9kw9WLWETPg+upwzBu145vMB7sPZwBuR_vDrbazeuag@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Thu, May 22, 2025 at 7:51 AM Alexey Makhmutov
<a(dot)makhmutov(at)postgrespro(dot)ru> wrote:
>
> Assuming following configuration with three connected servers A->B->C: A
> (primary), B (physical standby) and C (logical replica connected to B).
> If server A is under load and B is applying incoming WAL records while
> also streaming data via logical replication to C, then attempt to stop
> server B in 'fast' mode may by unsuccessful. In this case server will
> remain in PM_SHUTDOWN state indefinitely with all 'walsender' processes
> running in an infinite busy loop (consuming a CPU core each). To get
> server out of this state it's required either to either stop B using
> 'immediate' mode or stop server C (which will cause 'walsender'
> processes on server B to exit). This issue is reproducible on latest
> 'master', as well as on current PG 16/17 branches.
>
> Attached is a test scenario to reproduce the issue: 'test_scenario.zip'.
> This archive contains shell scripts to create
> the required environment (all three serves) and then to execute required
> steps to get server into incorrect state. First, edit the 'test_env.sh'
> file and specify path to PG binaries in PG_PATH variable and optionally
> set of ports used by test instances in 'pg_port' array. Then execute the
> 'test_prepare.sh' script, which will create, configure and start all
> three PG instances. Servers then could be started and stopped using
> corresponding start and stop scripts. To reproduce the issue, ensure
> that all three servers are running and execute the 'test_execute.sh'
> script. This script starts 'pgbench' instance in background for 30
> seconds to create load on server A, waits for 20 seconds and then try to
> stop the B instance using default 'fast' mode. Expected behavior is
> normal shutdown for B, while observed behavior is different: shutdown
> attempt fails and each remaining 'walsender' process consumes entire CPU
> core. To get out of this state one could use 'stop-C.sh' script to stop
> the server C, as it will complete shutdown process of B instance as well.
>
> My understanding is that this issue seems to be caused by the logic in
> 'GetStandbyFlushRecPtr' function, which returns current flush point for
> received WAL data. This position is used in 'XLogSendLogical' to
> calculate whether current walsender is in 'caught up' state (i.e.
> whether we send all the available data to downstream instance). During
> shutdown walsenders are allowed to continue their work until they are in
> 'caught up' state, while 'postmaster' is waiting for their completion.
> Currently 'GetStandbyFlushRecPtr' returns position of last stored
> record, rather than last applied record. This is correct for physical
> replication as we can send data to downstream instance without applying
> it to local system. However, for logical replication this seems to be
> incorrect, as we could not decode data until it's applied on current
> instance. So, if current stored WAL position differs from applied
> position while server is being stopped, then
> 'WalSndLoop'/'XLogSendLogical'/'XLogReadRecord' methods will spin in a
> busy loop, waiting for applied position to advance. The recovery process
> is already stopped at this moment, so this will be an infinite loop.
> Probably either 'GetStandbyFlushRecPtr' or
> 'WalSndLoop'/'XLogSendLogical' logic need to be adjusted to take into
> consideration such case with logical replication.
>
> Attached is also a patch, which aims to fix this issue:
> 0001-Use-only-replayed-position-as-target-flush-point-for.patch. It
> tries to to modify behavior of 'GetStandbyFlushRecPtr' function to
> ensure that it returns only applied position for logical replication.
> This function could be also invoked from slot synchronization routines
> and in this case it retains current behavior by returning last stored
> position.
>
The problem stated in 'logical-walsender' on 'physical standby' looks
genuine. I agree with the analysis for slot-sync as well. Slot-sync
does not need the fix as it deals only with flush-position and does
not care about replay-position. Since the problem area falls under
'Allow logical decoding on standbys', I am adding Bertrand for further
comments on this fix.
thanks
Shveta
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2025-05-22 10:56:55 | Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5 |
Previous Message | Ajin Cherian | 2025-05-22 09:48:03 | Re: Standby server with cascade logical replication could not be properly stopped under load |