RE: Newly created replication slot may be invalidated by checkpoint

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Vitaly Davydov' <v(dot)davydov(at)postgrespro(dot)ru>
Cc: suyu(dot)cmj <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, aekorotkov <aekorotkov(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, tomas <tomas(at)vondra(dot)me>, michael <michael(at)paquier(dot)xyz>, bharath(dot)rupireddyforpostgres <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Newly created replication slot may be invalidated by checkpoint
Date: 2025-10-07 11:53:34
Message-ID: OSCPR01MB14966C191ECEC11D051148D6DF5E0A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Vitaly,

> > Per my understanding, this happened because there is a lag that restart_lsn of
> > the slot is set, and it is protected by the system. Your idea is to ensure the
> > restart_lsn is protected by the system before obtaining on-memory LSN, right?
>
> Not sure what you mean by on-memory LSN, but, the issue happens because we
> have
> a lag between restart_lsn assignment and update of
> XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.

Sorry I should say "before obtaining replicationSlotMinLSN".

> Yes, I
> propose
> to ensure that the protection happens when we assign restart_lsn. It seems to be
> wrong that we invalidate slots by its restart_lsn but protect the wal for
> slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

> Below I tried to write some summary and propose the patch which fixes the
> problem.

Sorry but it is too long to understand properly for me :-(.

> There is one subtle thing. Once, the operation of restart_lsn assignment is not
> an atomic, the following scenario may happen theoretically:
> 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
> 2. Assign a new redo LSN in the checkpointer
> 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
> 3. Assign the old redo LSN to restart_lsn
>
> In this scenario, the restart_lsn will point to a previous redo LSN and it will
> be not protected by the new redo LSN. This scenario is unlikely, but it can
> happen theoretically. I have no ideas how to deal with it, except of assigning
> restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
> XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Viktor Holmberg 2025-10-07 11:56:46 ON CONFLICT DO SELECT (take 3)
Previous Message Hayato Kuroda (Fujitsu) 2025-10-07 11:49:13 RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE