From: | Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Standby server with cascade logical replication could not be properly stopped under load |
Date: | 2025-05-22 02:19:56 |
Message-ID: | 52138028-7246-421c-9161-4fa108b88070@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Assuming following configuration with three connected servers A->B->C: A
(primary), B (physical standby) and C (logical replica connected to B).
If server A is under load and B is applying incoming WAL records while
also streaming data via logical replication to C, then attempt to stop
server B in 'fast' mode may by unsuccessful. In this case server will
remain in PM_SHUTDOWN state indefinitely with all 'walsender' processes
running in an infinite busy loop (consuming a CPU core each). To get
server out of this state it's required either to either stop B using
'immediate' mode or stop server C (which will cause 'walsender'
processes on server B to exit). This issue is reproducible on latest
'master', as well as on current PG 16/17 branches.
Attached is a test scenario to reproduce the issue: 'test_scenario.zip'.
This archive contains shell scripts to create
the required environment (all three serves) and then to execute required
steps to get server into incorrect state. First, edit the 'test_env.sh'
file and specify path to PG binaries in PG_PATH variable and optionally
set of ports used by test instances in 'pg_port' array. Then execute the
'test_prepare.sh' script, which will create, configure and start all
three PG instances. Servers then could be started and stopped using
corresponding start and stop scripts. To reproduce the issue, ensure
that all three servers are running and execute the 'test_execute.sh'
script. This script starts 'pgbench' instance in background for 30
seconds to create load on server A, waits for 20 seconds and then try to
stop the B instance using default 'fast' mode. Expected behavior is
normal shutdown for B, while observed behavior is different: shutdown
attempt fails and each remaining 'walsender' process consumes entire CPU
core. To get out of this state one could use 'stop-C.sh' script to stop
the server C, as it will complete shutdown process of B instance as well.
My understanding is that this issue seems to be caused by the logic in
'GetStandbyFlushRecPtr' function, which returns current flush point for
received WAL data. This position is used in 'XLogSendLogical' to
calculate whether current walsender is in 'caught up' state (i.e.
whether we send all the available data to downstream instance). During
shutdown walsenders are allowed to continue their work until they are in
'caught up' state, while 'postmaster' is waiting for their completion.
Currently 'GetStandbyFlushRecPtr' returns position of last stored
record, rather than last applied record. This is correct for physical
replication as we can send data to downstream instance without applying
it to local system. However, for logical replication this seems to be
incorrect, as we could not decode data until it's applied on current
instance. So, if current stored WAL position differs from applied
position while server is being stopped, then
'WalSndLoop'/'XLogSendLogical'/'XLogReadRecord' methods will spin in a
busy loop, waiting for applied position to advance. The recovery process
is already stopped at this moment, so this will be an infinite loop.
Probably either 'GetStandbyFlushRecPtr' or
'WalSndLoop'/'XLogSendLogical' logic need to be adjusted to take into
consideration such case with logical replication.
Attached is also a patch, which aims to fix this issue:
0001-Use-only-replayed-position-as-target-flush-point-for.patch. It
tries to to modify behavior of 'GetStandbyFlushRecPtr' function to
ensure that it returns only applied position for logical replication.
This function could be also invoked from slot synchronization routines
and in this case it retains current behavior by returning last stored
position.
Thanks,
Alexey
Attachment | Content-Type | Size |
---|---|---|
test_scenario.zip | application/zip | 5.1 KB |
0001-Use-only-replayed-position-as-target-flush-point-for.patch | text/x-patch | 2.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael.Arlt@universa.de | 2025-05-22 07:08:04 | sudo dnf upgrade fails due to new llvm-libs-19.1.7-2.el9.x86_64 |
Previous Message | Masahiko Sawada | 2025-05-21 18:24:14 | Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5 |