Re: Clear logical slot's 'synced' flag on promotion of standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Clear logical slot's 'synced' flag on promotion of standby
Date: 2025-09-12 03:56:41
Message-ID: CAJpy0uA111v1-3Lmo-J+QsCSLFMOYnJpOestsoH4CQHgyP4OMA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 11, 2025 at 7:29 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> On Thu, Sep 11, 2025 at 9:17 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Tue, Sep 9, 2025 at 2:19 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > >
> > >
> > > + * required resources. Clear any leftover 'synced' flags on replication
> > > + * slots when in crash recovery on the primary. The DB_IN_CRASH_RECOVERY
> > > + * state check ensures that this code is only reached when a standby
> > > + * server crashes during promotion.
> > > */
> > > StartupReplicationSlots();
> > > + if (ControlFile->state == DB_IN_CRASH_RECOVERY)
> > >
> > > I believe the primary server can also enter the DB_IN_CRASH_RECOVERY
> > > state. For example, if the primary is already in crash recovery and
> > > crashes again while in crash recovery, it will restart in the
> > > DB_IN_CRASH_RECOVERY state, no?
> > >
> >
> > Yes, good point. I think we can differentiate the two cases based on
> > the timeline change. A regular primary won't have a timeline change,
> > whereas a promoted standby that failed during promotion will show a
> > timeline change immediately upon restart. Thoughts?
> >
>
> Will there be any issues if we clear the sync status immediately after
> the standby.signal file is removed from the standby server?
>
> We could maybe introduce a temporary "promote.inprogress" marker file
> on disk before removing standby.signal. The sequence would be:
>
> 1) Create promote.inprogress.
> 2) Unlink standby.signal
> 3) Clear the sync slot status.
> 4) Remove promote.inprogress.
>
> This way, if the server crashes after standby.signal is removed but
> before the sync status is cleared, the presence of promote.inprogress
> would indicate that the standby was in the middle of promotion and
> crashed before slot cleanup. On restart, we could use that marker to
> detect the incomplete promotion and finish clearing the sync flags.
>
> If the crash happens at a later stage, the server will no longer start
> as a standby anyway, and by then the sync flags would already have
> been reset.
>
> This is just a thought and it may sound a bit naive. Let me know if I
> am overlooking something.
>

The approach seems valid and should work, but introducing a new file
like promote.inprogress for this purpose might be excessive. We can
first try analyzing existing information to determine whether we can
distinguish between the two scenarios -- a primary in crash recovery
immediately after a promotion attempt versus a regular primary. If we
are unable to find any way, we can revisit the idea.

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-09-12 04:08:11 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message shveta malik 2025-09-12 03:42:36 Re: Clear logical slot's 'synced' flag on promotion of standby