From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
---|---|
To: | 'Vitaly Davydov' <v(dot)davydov(at)postgrespro(dot)ru> |
Cc: | suyu(dot)cmj <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, aekorotkov <aekorotkov(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, tomas <tomas(at)vondra(dot)me>, michael <michael(at)paquier(dot)xyz>, bharath(dot)rupireddyforpostgres <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | RE: Newly created replication slot may be invalidated by checkpoint |
Date: | 2025-10-07 11:53:34 |
Message-ID: | OSCPR01MB14966C191ECEC11D051148D6DF5E0A@OSCPR01MB14966.jpnprd01.prod.outlook.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear Vitaly,
> > Per my understanding, this happened because there is a lag that restart_lsn of
> > the slot is set, and it is protected by the system. Your idea is to ensure the
> > restart_lsn is protected by the system before obtaining on-memory LSN, right?
>
> Not sure what you mean by on-memory LSN, but, the issue happens because we
> have
> a lag between restart_lsn assignment and update of
> XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.
Sorry I should say "before obtaining replicationSlotMinLSN".
> Yes, I
> propose
> to ensure that the protection happens when we assign restart_lsn. It seems to be
> wrong that we invalidate slots by its restart_lsn but protect the wal for
> slots using XLogCtl->replicationSlotsMinLSN.
Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?
> Below I tried to write some summary and propose the patch which fixes the
> problem.
Sorry but it is too long to understand properly for me :-(.
> There is one subtle thing. Once, the operation of restart_lsn assignment is not
> an atomic, the following scenario may happen theoretically:
> 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
> 2. Assign a new redo LSN in the checkpointer
> 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
> 3. Assign the old redo LSN to restart_lsn
>
> In this scenario, the restart_lsn will point to a previous redo LSN and it will
> be not protected by the new redo LSN. This scenario is unlikely, but it can
> happen theoretically. I have no ideas how to deal with it, except of assigning
> restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
> XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.
Oh, your point is there is another race condition, right? Do you have the reproducer for it?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
From | Date | Subject | |
---|---|---|---|
Next Message | Viktor Holmberg | 2025-10-07 11:56:46 | ON CONFLICT DO SELECT (take 3) |
Previous Message | Hayato Kuroda (Fujitsu) | 2025-10-07 11:49:13 | RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE |