From: | Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: Standby server with cascade logical replication could not be properly stopped under load |
Date: | 2025-05-28 16:12:25 |
Message-ID: | a4f477e0-5d17-466b-a8cc-e0d905d1161b@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 5/28/25 10:35, Michael Paquier wrote:
> Wouldn't it be better to extend GetStandbyFlushRecPtr() so as the
> caller can directly decide if they want to include in the maths the
> last LSN flushed by the WAL receiver, depending on the code path where
> the routine is called? That makes for a larger patch, of course, but
> it removes the need to look at MyWalSnd inside
> GetStandbyFlushRecPtr().
Well, this is a great question, because it made me thinking whether we
need to change behavior of THIS function at all. Extending
'GetStandbyFlushRecPtr' to properly support logical replication will
just make it equal to the 'GetXLogReplayRecPtr'. So, basically we just
need to replace invocation of 'GetStandbyFlushRecPtr' with
'GetXLogReplayRecPtr' and this should solve the problem.
The only invoker of 'GetStandbyFlushRecPtr' outside of walsender.c is
slot synchronization routine, which expects to get 'physical' position
anyway, so it does not need to be changed. All other invocations are
performed from the 'walsender.c' file itself. Obviously, both
'StartReplication' (physical) and 'XLogSendPhysical' functions need
current logic, so we can skip them. The 'XLogSendLogical' need to be
changed to use 'GetXLogReplayRecPtr' (to get consistent behavior with
WalSndWaitForWal and actually fix the original problem discussed in this
thread). This leaves us with only single function to consider:
'IdentifySystem'. This function could be called both in case of physical
and logical replication, but I assume that we need to return flushed WAL
position in it as well, as it's described in the documentation for this
command. This also seems to be the right behavior, as primary goal for
'identify' command is to inform client about the potential replication
start point and we definitely can start replication from the flushed
position (as we already has this record), just with some delay (once our
recovery get us to this point). But maybe somebody with more knowledge
of the logical replication expectation could confirm this.
The fix seems to be very simple in this case (assuming the logic above
is correct): we only replace 'GetStandbyFlushRecPtr' call in
'XLogSendLogical' with the 'GetXLogReplayRecPtr' and that's it. Attached
is the modified patch. It could be applied on top of 16, 17 and master
versions.
As for the test: do you want this case with shutdown to be added to the
existing test plan for standby logical recovery? The condition to
trigger the problem is just to have a flushed and replayed positions
pointing to different values during standby shutdown. I could try to add
it to the test suit, but it could take some time.
Thanks,
Alexey
Attachment | Content-Type | Size |
---|---|---|
0001-Use-only-replayed-position-as-target-flush-point-4-pg-master.patch | text/x-patch | 2.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2025-05-28 17:15:46 | Re: BUG #18938: Logical replication failure in 16.9: "invalid memory alloc request size 1372786672" |
Previous Message | PG Bug reporting form | 2025-05-28 14:54:37 | BUG #18939: GIN index not used when it's expected to be |