Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, tomas(at)vondra(dot)me
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date: 2025-05-24 10:37:52
Message-ID: CAPpHfdutKQxpm-gJgiZRb2ouKC9+HZx3fG3F00zd=xdxDidm_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 23, 2025 at 12:10 AM Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
>
> Hi, Vitaly!
>
> On Tue, May 20, 2025 at 6:44 PM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
> >
> > Thank you very much for the review!
> >
> > > The patchset doesn't seem to build after 371f2db8b0, which adjusted
> > > the signature of the INJECTION_POINT() macro. Could you, please,
> > > update the patchset accordingly.
> >
> > I've updated the patch (see attached). Thanks.
> >
> > > I see in 0004 patch we're calling XLogGetReplicationSlotMinimumLSN()
> > > before slots synchronization then use it for WAL truncation.
> > > Generally looks good. But what about the "if
> > > (InvalidateObsoleteReplicationSlots(...))" branch? It calls
> > > XLogGetReplicationSlotMinimumLSN() again. Why would the value
> > > obtained from the latter call reflect slots as they are synchronized
> > > to the disk?
> >
> > In patch 0004 I call XLogGetReplicationSlotMinimumLSN() again to keep the old
> > behaviour - this function was called in KeepLogSeg prior to my change. I also
> > call CheckPointReplicationSlots at the next line to save the invalidated and
> > other dirty slots on disk again to make sure, the new oldest LSN is in sync.
> >
> > The problem I tried to solve in this if-branch is to fix test
> > src/test/recovery/t/019_replslot_limit.pl which was failed because the WAL was
> > not truncated enought for the test to pass ok. In general, this branch is not
> > necessary and we may fix the test by calling checkpoint twice (please, see the
> > alternative.rej patch for this case). If you think, we should incorporate this
> > new change, I'm ok to do it. But the WAL will be truncating more lazily.
> >
> > Furthermore, I think we can save slots on disk right after invalidation, not in
> > CheckPointGuts to avoid saving invalidated slots twice.
>
> Thank you for the clarification. It's all good. I just missed that
> CheckPointReplicationSlots() syncs slots inside the "if" branch.
>
> I've reordered the patchset. Fix should come first, tests comes
> second. So, tests pass after each commit. Also I've joined both
> tests and injection points into single commit. I don't see reason to
> place tests into src/test/modules, because there is no module. I've
> moved them into src/test/recovery.
>
> I also improved some comments and commit messages. I think 0001
> should go to all supported releases as it fixes material bug, while
> 0002 should be backpatched to 17, where injection points fist appears.
> 0003 should go to pg19 after branching. I'm continuing reviewing
> this.

I spend more time on this. The next revision is attached. It
contains revised comments and other cosmetic changes. I'm going to
backpatch 0001 to all supported branches, and 0002 to 17 where
injection points were introduced.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v5-0003-Remove-redundant-ReplicationSlotsComputeRequiredL.patch application/octet-stream 6.5 KB
v5-0002-Add-TAP-tests-to-check-replication-slot-advance-d.patch application/octet-stream 15.2 KB
v5-0001-Keep-WAL-segments-by-the-flushed-value-of-the-slo.patch application/octet-stream 7.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-05-24 11:14:49 Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Previous Message Florents Tselai 2025-05-24 10:34:12 Re: mention unused_oids script in pg_proc.dat