Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "tomas(at)vondra(dot)me" <tomas(at)vondra(dot)me>
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date: 2025-06-20 00:18:20
Message-ID: CAPpHfdvk5RxdKZuFDFgDet6ZAzVW0ojxP-pjjqZPFZUW2N5gEA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 19, 2025 at 1:29 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Jun 18, 2025 at 10:17 PM Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Wed, Jun 18, 2025 at 6:50 PM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
> > > > I think, it is a good idea. Once we do not use the generated data, it is ok
> > > > just to generate WAL segments using the proposed function. I've tested this
> > > > function. The tests worked as expected with and without the fix. The attached
> > > > patch does the change.
> > >
> > > Sorry, forgot to attach the patch. It is created on the current master branch.
> > > It may conflict with your corrections. I hope, it could be useful.
> >
> > Thank you. I've integrated this into a patch to improve these tests.
> >
> > Regarding assertion failure, I've found that assert in
> > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > previously set by ReplicationSlotReserveWal(). As I can see,
> > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > backward. The commit in ReplicationSlotReserveWal() even states there
> > is a "chance that we have to retry".
> >
>
> I don't see how this theory can lead to a restart_lsn of a slot going
> backwards. The retry mentioned there is just a retry to reserve the
> slot's position again if the required WAL is already removed. Such a
> retry can only get the position later than the previous restart_lsn.

Yes, if retry is needed, then the new position must be later for sure.
What I mean is that ReplicationSlotReserveWal() can reserve something
later than what standby is going to read (and correspondingly report
with PhysicalConfirmReceivedLocation()).

> > Thus, I propose to remove the
> > assertion introduced by ca307d5cec90.
> >
>
> If what I said above is correct, then the following part of the commit
> message will be incorrect:
> "As stated in the ReplicationSlotReserveWal() comment, this is not
> always true. Additionally, this issue has been spotted by some
> buildfarm
> members."

I agree, this comment needs improvement in terms of clarity.

Meanwhile I've pushed the patch for TAP tests, which I think didn't
get any objections.

------
Regards,
Alexander Korotkov
Supabase

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2025-06-20 00:24:16 Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Previous Message Michael Paquier 2025-06-20 00:02:00 Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData