Re: Standby server with cascade logical replication could not be properly stopped under load

From: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Standby server with cascade logical replication could not be properly stopped under load
Date: 2025-05-28 16:12:25
Message-ID: a4f477e0-5d17-466b-a8cc-e0d905d1161b@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 5/28/25 10:35, Michael Paquier wrote:
> Wouldn't it be better to extend GetStandbyFlushRecPtr() so as the
> caller can directly decide if they want to include in the maths the
> last LSN flushed by the WAL receiver, depending on the code path where
> the routine is called? That makes for a larger patch, of course, but
> it removes the need to look at MyWalSnd inside
> GetStandbyFlushRecPtr().

Well, this is a great question, because it made me thinking whether we
need to change behavior of THIS function at all. Extending
'GetStandbyFlushRecPtr' to properly support logical replication will
just make it equal to the 'GetXLogReplayRecPtr'. So, basically we just
need to replace invocation of 'GetStandbyFlushRecPtr' with
'GetXLogReplayRecPtr' and this should solve the problem.

The only invoker of 'GetStandbyFlushRecPtr' outside of walsender.c is
slot synchronization routine, which expects to get 'physical' position
anyway, so it does not need to be changed. All other invocations are
performed from the 'walsender.c' file itself. Obviously, both
'StartReplication' (physical) and 'XLogSendPhysical' functions need
current logic, so we can skip them. The 'XLogSendLogical' need to be
changed to use 'GetXLogReplayRecPtr' (to get consistent behavior with
WalSndWaitForWal and actually fix the original problem discussed in this
thread). This leaves us with only single function to consider:
'IdentifySystem'. This function could be called both in case of physical
and logical replication, but I assume that we need to return flushed WAL
position in it as well, as it's described in the documentation for this
command. This also seems to be the right behavior, as primary goal for
'identify' command is to inform client about the potential replication
start point and we definitely can start replication from the flushed
position (as we already has this record), just with some delay (once our
recovery get us to this point). But maybe somebody with more knowledge
of the logical replication expectation could confirm this.

The fix seems to be very simple in this case (assuming the logic above
is correct): we only replace 'GetStandbyFlushRecPtr' call in
'XLogSendLogical' with the 'GetXLogReplayRecPtr' and that's it. Attached
is the modified patch. It could be applied on top of 16, 17 and master
versions.

As for the test: do you want this case with shutdown to be added to the
existing test plan for standby logical recovery? The condition to
trigger the problem is just to have a flushed and replayed positions
pointing to different values during standby shutdown. I could try to add
it to the test suit, but it could take some time.

Thanks,
Alexey

Attachment Content-Type Size
0001-Use-only-replayed-position-as-target-flush-point-4-pg-master.patch text/x-patch 2.3 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2025-05-28 17:15:46 Re: BUG #18938: Logical replication failure in 16.9: "invalid memory alloc request size 1372786672"
Previous Message PG Bug reporting form 2025-05-28 14:54:37 BUG #18939: GIN index not used when it's expected to be