| From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
|---|---|
| To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
| Cc: | Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "suyu(dot)cmj" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, tomas <tomas(at)vondra(dot)me>, michael <michael(at)paquier(dot)xyz>, "bharath(dot)rupireddyforpostgres" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Newly created replication slot may be invalidated by checkpoint |
| Date: | 2025-11-11 03:06:46 |
| Message-ID: | CAA4eK1J0GJgweNgddAOTs6R7DeOkteERsGNPCaa+Fo9dgnvmZw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Nov 10, 2025 at 2:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Nov 5, 2025 at 3:48 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
> > > There is one subtle thing. Once, the operation of restart_lsn assignment is not
> > > an atomic, the following scenario may happen theoretically:
> > > 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
> > > 2. Assign a new redo LSN in the checkpointer
> > > 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
> > > 3. Assign the old redo LSN to restart_lsn
> > >
> > > In this scenario, the restart_lsn will point to a previous redo LSN and it will
> > > be not protected by the new redo LSN. This scenario is unlikely, but it can
> > > happen theoretically. I have no ideas how to deal with it, except of assigning
> > > restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
> > > XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.
> > >
> > > In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
> > > redo LSN seems to work as well, because a new redo LSN is taken from the latest
> > > replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
> > > be less than the new redo LSN, if it is called right after assignment of redo
> > > LSN in CreateRestartPoint().
> >
> > Thank you for highlighting this scenario. I've reviewed it. I think
> > we could avoid it by covering appropriate parts of
> > ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
> > LWLock. The draft patch is attached. What do you think?
> >
>
> The fix seems to be only provided for bank branches, but IIUC the
> problem can happen in HEAD as well. In Head, how about acquiring
> ReplicationSlotAllocationLock in Exclusive mode during
> ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
> CheckPointReplicationSlots. So, this should make our calculations
> correct and avoid invalidating the newly created slot.
>
We need to check whether a similar change is required in
reserve_wal_for_local_slot() as well for sync slots.
--
With Regards,
Amit Kapila.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ajin Cherian | 2025-11-11 03:08:52 | Add support for COPY TO in tablesync for partitioned tables. |
| Previous Message | jian he | 2025-11-11 03:06:11 | Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions |