From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "tomas(at)vondra(dot)me" <tomas(at)vondra(dot)me> |
Subject: | Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly |
Date: | 2025-06-20 12:41:57 |
Message-ID: | CALDaNm34RtDat+kZE_1+CY86pDLJM18hxNRTZx3CCGE=aN=_=A@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> Dear Kuroda-san,
>
> On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > > > Regarding assertion failure, I've found that assert in
> > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > > > previously set by ReplicationSlotReserveWal(). As I can see,
> > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > > > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > > > backward. The commit in ReplicationSlotReserveWal() even states there
> > > > is a "chance that we have to retry".
> > > >
> > >
> > > I don't see how this theory can lead to a restart_lsn of a slot going
> > > backwards. The retry mentioned there is just a retry to reserve the
> > > slot's position again if the required WAL is already removed. Such a
> > > retry can only get the position later than the previous restart_lsn.
> >
> > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal,
> > and confirmed that restart_lsn can go backward. This meant that Assert() added
> > by the ca307d5 can cause crash.
> >
> > Background
> > ===========
> > When pg_receivewal starts the replication and it uses the replication slot, it
> > sets as the beginning of the segment where restart_lsn exists, as the startpoint.
> > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs
> > from 0/B00000.
> > More detail of this behavior, see f61e1dd2 and d9bae531.
> >
> > What happened here
> > ==================
> > Based on above theory, walsender sent from the beginning of segment (0/B00000).
> > When walreceiver receives, it tried to send reply. At that time the flushed
> > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn
> > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000).
> >
> > The assertion failure could happen if CHECKPOINT happened at that time.
> > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn
> > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot().
>
> Thank you for your detailed explanation!
>
> > Note
> > ====
> > 1.
> > In this case, starting from the beginning of the segment is not a problem, because
> > the checkpoint process only removes WAL files from segments that precede the
> > restart_lsn's wal segment. The current segment (0/B00000) will not be removed,
> > so there is no risk of data loss or inconsistency.
> >
> > 2.
> > A similar pattern applies to pg_basebackup. Both use logic that adjusts the
> > requested streaming position to the start of the segment, and it replies the
> > received LSN as flushed.
> >
> > 3.
> > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync
> > because it is a timing issue. Have someone else reproduced?
> >
> > We are still investigating failure caused at 040_standby_failover_slots_sync.
>
> I didn't manage to reproduce this. But as I see from the logs [1] on
> mamba that START_REPLICATION command was issued just before assert
> trap. Could it be something similar to what I described in [2].
> Namely:
> 1. ReplicationSlotReserveWal() sets restart_lsn for the slot.
> 2. Concurrent checkpoint flushes that restart_lsn to the disk.
> 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to
> the beginning of the segment.
Here is my analysis for the 040_standby_failover_slots_sync test
failure where in physical replication slot can point to backward lsn:
In certain rare cases the restart LSN can go backwards. This scenario
can be reproduced by performing checkpoint continuously and slowing
the WAL applying. I have a patch with changes to handle this.
Here's a summary of the sequence of events:
1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents:
After standby writes the received WAL contents in XLogWalRcvWrite, the
standby sends this lsn 0/40369C8 as the confirmed flush location. The
primary acknowledges this and updates the replication slot's
restart_lsn accordingly:
2025-06-20 14:33:21.777 IST [134998] standby1 LOG:
PhysicalConfirmReceivedLocation replication slot "sb1_slot" set
restart_lsn to 0/40369C8
2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT:
START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1
Checkpoint persists the new restart_lsn:
2) Shortly after receiving the new LSN, a checkpoint occurs which
saves this restart_lsn:
2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote
0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0
removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync
files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0
kB; lsn=0/4036A20, redo lsn=0/40369C8
3)Streaming replication is restarted because of primary_conninfo
change and reload
The WAL receiver process is restarted:
25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver
process due to administrator command
4) Standby sends an older flush pointer after restart:
Upon restart, the WAL receiver sends a flush location (0/401D448)
derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older
than the previously confirmed restart_lsn. It is important to note
that we are sending the lastReplayedEndRecPtr which is the last
successfully replayed lsn in this case:
5-06-20 14:33:21.796 IST [135135] LOG: WalReceiverMain
LogstreamResult.Flush initialized to 0/401D448
2025-06-20 14:33:21.796 IST [135135] LOG: sending write 0/401D448
flush 0/401D448 apply 0/401D448
This is done from here:
....
/* Initialize LogstreamResult and buffers for processing messages */
LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
initStringInfo(&reply_message);
/* Initialize nap wakeup times. */
now = GetCurrentTimestamp();
for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
WalRcvComputeNextWakeup(i, now);
/* Send initial reply/feedback messages. */
XLogWalRcvSendReply(true, false);
...
In case of step 1, we are sending the lsn of the WAL that is written,
since we have slowed down the WAL applying the replay location is
lesser and the replay location is being sent here.
5) I have added logs to detect this inconsistency:
This leads to a scenario where the standby tries to confirm a
restart_lsn older than the one currently held by the primary:
2025-06-20 14:33:21.797 IST [135136] standby1 LOG: crash scenario -
slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older
than the current one (0/40369C8)
If a checkpoint happens concurrently during this condition, it may
trigger an assertion failure on the primary due to the restart_lsn
being less than the last_saved_restart_lsn.
Currently this does not break physical replication, but I'm not sure
if the gap increases to many WAL files and if the WAL files get
deleted, how it will behave.
Attached the patch changes with which you can reproduce. Grep for
"crash scenario" in the logs. For me it occurs with every run. The
reproduced logs are attached.
This proves that the restart_lsn can go backward in cases where the
standby is slowly applying. But this has nothing to do with this
thread, I felt you can commit the assert removal patch. I will
continue the analysis further and see if there is any impact or not
and we can later add comments accordingly.
Regards,
Vignesh
Attachment | Content-Type | Size |
---|---|---|
restart_lsn_backup_repro_v1.patch | application/octet-stream | 27.0 KB |
040_standby_failover_slots_sync_logs.7z | application/x-compressed | 214.9 KB |
From | Date | Subject | |
---|---|---|---|
Previous Message | Amit Langote | 2025-06-20 12:30:18 | Re: generic plans and "initial" pruning |