Re: Clear logical slot's 'synced' flag on promotion of standby

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Clear logical slot's 'synced' flag on promotion of standby
Date: 2025-09-09 08:48:55
Message-ID: CAE9k0P=WXRHXLGxkegFLj9tVLrY45+uTtdgv+Pjt1mqyit4zZw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Sep 9, 2025 at 12:53 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Tue, Sep 9, 2025 at 4:21 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > This is a spin-off thread from [1].
> >
> > Currently, in the slot-sync worker, we have an error scenario [2]
> > where, during slot synchronization, if we detect a slot with the same
> > name and its synced flag is set to false, we emit an error. The
> > rationale is to avoid potentially overwriting a user-created slot.
> >
> > But while analyzing [1], we observed that this error can lead to
> > inconsistent behavior during switchovers. On the first switchover, the
> > new standby logs an error: "Exiting from slot synchronization because
> > a slot with the same name already exists on the standby." But during
> > a double switchover, this error does not occur.
> >
> > Upon re-evaluating this, it seems more appropriate to clear the synced
> > flag after promotion, as the flag does not hold any meaning on the
> > primary. Doing so would ensure consistent behavior across all
> > switchovers, as the same error will be raised avoiding the risk of
> > overwriting user's slots.
> >
> > A patch can be posted soon on the same idea.
>
> Hi Shveta,
>
> Here’s a patch that addresses this issue. It clears any “synced” flags
> on logical replication slots when a standby is promoted. I’ve also
> added handling for crashes; if the server crashes before the flags are
> cleared, they are reset on restart.
> The restart logic was a bit tricky, since I had to rely on the
> database state to decide when the reset is needed. Documentation on
> these states is sparse, but from my testing I found that
> DB_IN_CRASH_RECOVERY occurs when a standby crashes during promotion.
> That’s the state I use to trigger the flag reset on restart.
>

+ * required resources. Clear any leftover 'synced' flags on replication
+ * slots when in crash recovery on the primary. The DB_IN_CRASH_RECOVERY
+ * state check ensures that this code is only reached when a standby
+ * server crashes during promotion.
*/
StartupReplicationSlots();
+ if (ControlFile->state == DB_IN_CRASH_RECOVERY)

I believe the primary server can also enter the DB_IN_CRASH_RECOVERY
state. For example, if the primary is already in crash recovery and
crashes again while in crash recovery, it will restart in the
DB_IN_CRASH_RECOVERY state, no?

--

With this change are we saying that on primary the synced flag must be
always false. Because the postgres doc on pg_replication_slots says:

"The value of this column has no meaning on the primary server; the
column value on the primary is default false for all slots but may (if
leftover from a promoted standby) also be true."

--
With Regards,
Ashutosh Sharma.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2025-09-09 09:07:54 Re: Eager aggregation, take 3
Previous Message Dilip Kumar 2025-09-09 08:37:18 Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles