RE: Newly created replication slot may be invalidated by checkpoint

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Vitaly Davydov' <v(dot)davydov(at)postgrespro(dot)ru>
Cc: suyu(dot)cmj <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, aekorotkov <aekorotkov(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, tomas <tomas(at)vondra(dot)me>, michael <michael(at)paquier(dot)xyz>, bharath(dot)rupireddyforpostgres <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Newly created replication slot may be invalidated by checkpoint
Date: 2025-10-07 11:53:34
Message-ID: OSCPR01MB14966C191ECEC11D051148D6DF5E0A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Vitaly,

> > Per my understanding, this happened because there is a lag that restart_lsn of
> > the slot is set, and it is protected by the system. Your idea is to ensure the
> > restart_lsn is protected by the system before obtaining on-memory LSN, right?
>
> Not sure what you mean by on-memory LSN, but, the issue happens because we
> have
> a lag between restart_lsn assignment and update of
> XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.

Sorry I should say "before obtaining replicationSlotMinLSN".

> Yes, I
> propose
> to ensure that the protection happens when we assign restart_lsn. It seems to be
> wrong that we invalidate slots by its restart_lsn but protect the wal for
> slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

> Below I tried to write some summary and propose the patch which fixes the
> problem.

Sorry but it is too long to understand properly for me :-(.

> There is one subtle thing. Once, the operation of restart_lsn assignment is not
> an atomic, the following scenario may happen theoretically:
> 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
> 2. Assign a new redo LSN in the checkpointer
> 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
> 3. Assign the old redo LSN to restart_lsn
>
> In this scenario, the restart_lsn will point to a previous redo LSN and it will
> be not protected by the new redo LSN. This scenario is unlikely, but it can
> happen theoretically. I have no ideas how to deal with it, except of assigning
> restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
> XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Viktor Holmberg 2025-10-07 11:56:46 ON CONFLICT DO SELECT (take 3)
Previous Message Hayato Kuroda (Fujitsu) 2025-10-07 11:49:13 RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE