Re: Clear logical slot's 'synced' flag on promotion of standby

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Clear logical slot's 'synced' flag on promotion of standby
Date: 2025-10-03 23:03:30
Message-ID: CAD21AoCpj0Sr7hYJXgF2Ata-zfoovO9OBF_QhreYxx23L2S9Ew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 10, 2025 at 9:00 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Wed, Sep 10, 2025 at 5:23 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Sep 8, 2025 at 11:21 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > >
> > > This is a spin-off thread from [1].
> > >
> > > Currently, in the slot-sync worker, we have an error scenario [2]
> > > where, during slot synchronization, if we detect a slot with the same
> > > name and its synced flag is set to false, we emit an error. The
> > > rationale is to avoid potentially overwriting a user-created slot.
> > >
> > > But while analyzing [1], we observed that this error can lead to
> > > inconsistent behavior during switchovers. On the first switchover, the
> > > new standby logs an error: "Exiting from slot synchronization because
> > > a slot with the same name already exists on the standby." But during
> > > a double switchover, this error does not occur.
> > >
> > > Upon re-evaluating this, it seems more appropriate to clear the synced
> > > flag after promotion, as the flag does not hold any meaning on the
> > > primary. Doing so would ensure consistent behavior across all
> > > switchovers, as the same error will be raised avoiding the risk of
> > > overwriting user's slots.
> >
> > There is the following comment in FinishWalRecovery():
> >
> > /*
> > * Shutdown the slot sync worker to drop any temporary slots acquired by
> > * it and to prevent it from keep trying to fetch the failover slots.
> > *
> > * We do not update the 'synced' column in 'pg_replication_slots' system
> > * view from true to false here, as any failed update could leave 'synced'
> > * column false for some slots. This could cause issues during slot sync
> > * after restarting the server as a standby. While updating the 'synced'
> > * column after switching to the new timeline is an option, it does not
> > * simplify the handling for the 'synced' column. Therefore, we retain the
> > * 'synced' column as true after promotion as it may provide useful
> > * information about the slot origin.
> > */
> > ShutDownSlotSync();
> >
> > Does the patch address the above concerns?
> >
>
> Yes, the patch is attempting to address the above concern. it is
> trying to Reset synced-column after switching to a new timeline. There
> is an issue though as pointed out by Ashutosh in [1], which needs to
> be addressed.

Nice.

There's an ongoing discussion about a patch that would allow users to
overwrite slot properties[1]. IIUC, the reported inconsistency during
switchover would be resolved by that slot-overwriting patch. I'm
looking into the relationship between the patch discussed in this
thread and the slot-overwriting patch. While I'm not yet convinced
that the proposed allowing slot patch is the right approach, suppose
that we do allow slot overwriting somehow, what value would the patch
proposed in this thread add? Would its only benefit be ensuring that
the 'synced' flag is set to false on the primary?

Regards,

[1] https://www.postgresql.org/message-id/CAA5-nLAqGpBFEAr2XNYMj3E%2B39caQra_SJeB5MCtp7PCyLTiOg%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Xuneng Zhou 2025-10-04 01:35:32 Re: Implement waiting for wal lsn replay: reloaded
Previous Message David Rowley 2025-10-03 22:38:33 Re: Fixing a few minor misusages of bms_union()