Re: failover logical replication slots

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: failover logical replication slots
Date: 2025-06-12 09:02:19
Message-ID: CAA5-nLCojwRhu5Xmv66wNRC+Q_X-_KESyeihbAfQCVj8ZS1U4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the reply Amit,

I don't really understand the logic of the implementation. If the slot name
matches that of the primary slot and this slot is in failover mode, how
could it be any different on the standby slot?
After the first failover, the following failovers will work given that the
sync flag is true on both the primary and standby slots.

After new sandby is attached to the primary, can we imagine that when the
sync worker process is started we check if a failover slot exists on the
standby, if so we drop it before recreating a new one for syncing?

Regards,

Fabrice

On Thu, Jun 12, 2025 at 5:14 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Wed, Jun 11, 2025 at 10:17 PM Fabrice Chapuis
> <fabrice636861(at)gmail(dot)com> wrote:
> >
> > Thanks for your reply.
> > The problem I see is that after creating a new subscription, we have:
> >
> > 1) if a failover occurs, on the new primary node, the failover and sync
> flags are both set to true, so there's no problem.
> >
> > 2) when the old node returns as a secondary in the cluster, the failover
> flag is set to true and the sync flag is set to false then
> > the error message is generated: ERROR: exiting from slot
> synchronization because same name slot "sub_test" already exists on the
> standby
> >
> > Why not change the value of the synced flag when the standby is joining
> the cluster ? If the slot on the primary node has the same name as the slot
> on the secondary node and the failover flag is set to true,
> >
> > if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) {
> > slot->data.synced = true
> > ...
>
> IIUC, Hou-san also mentioned the same idea, but it is not that
> straightforward because the user may have created a logical slot with
> the same name but with a few other different properties like
> two_phase, slot_type, etc. I think we can try to compare all such slot
> properties to ensure that we can overwrite the same name slot, but
> there is still a chance that we may overwrite a slot that the user has
> created for some other purpose. Now, we may want to extend this
> functionality such that we give some knob to user which allows us to
> overwrite the existing slots with same name. Then user can use this
> knob (GUC or something else) when starting the node as standby after
> switchover and allow the overwrite for existing slots.
>
> As mentioned by Hou-San and Dilip, I also think it is more important
> for the old node that comes as a standby to remove logical slots to
> avoid WAL accumulation. For example, we can provide a function like
> pg_drop_all_slots() with a type parameter indicating logical or
> physical, and then utilities like patroni that provide switchover
> functionality can use that function to remove all existing slots
> (maybe keep the slots that are required for failover) when starting
> the node as a standby.
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2025-06-12 09:34:34 Re: [PATCH] Proposal: Improvements to PDF stylesheet and table column widths
Previous Message Zhijie Hou (Fujitsu) 2025-06-12 08:44:22 RE: Logical Replication slot disappeared after promote Standby