From: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
---|---|
To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
Cc: | Ajin Cherian <itsajin(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Clear logical slot's 'synced' flag on promotion of standby |
Date: | 2025-09-22 05:31:31 |
Message-ID: | CAE9k0PkZ6CDhFFV3qepOZOpsdC4ujuGsZbDMBr7Go2Ez6-qzFg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 22, 2025 at 8:58 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Fri, Sep 19, 2025 at 7:29 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> >
> > On Fri, Sep 19, 2025 at 3:04 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Sep 18, 2025 at 5:20 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > > >
> > > > Hi Ajin,
> > > >
> > > > On Thu, Sep 18, 2025 at 4:16 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Fri, Sep 12, 2025 at 1:56 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > The approach seems valid and should work, but introducing a new file
> > > > > > like promote.inprogress for this purpose might be excessive. We can
> > > > > > first try analyzing existing information to determine whether we can
> > > > > > distinguish between the two scenarios -- a primary in crash recovery
> > > > > > immediately after a promotion attempt versus a regular primary. If we
> > > > > > are unable to find any way, we can revisit the idea.
> > > > > >
> > > > >
> > > > > I needed a way to reset slots not only during promotion, but also
> > > > > after a crash that occurs while slots are being reset, so there would
> > > > > be a fallback mechanism to clear them again on startup. As Shveta
> > > > > pointed out, it wasn’t trivial to tell apart a standby restarting
> > > > > after crashing during promotion from a primary restarting after a
> > > > > crash. So I decided to just reset slots every time primary (or a
> > > > > standby after promotion) restarts.
> > > > >
> > > > > Because this fallback logic will run on every primary restart, it was
> > > > > important to minimize overhead added by the patch. After some
> > > > > discussion, I placed the reset logic in RestoreSlotFromDisk(), which
> > > > > is invoked by StartupReplicationSlots() whenever the server starts.
> > > > > Because RestoreSlotFromDisk() already loops through all slots, this
> > > > > adds minimum extra work; but also ensures the synced flag is cleared
> > > > > when running on a primary.
> > > > >
> > > > > The next challenge was finding a reliable flag to distinguish
> > > > > primaries from standbys, since we really don’t want to reset the flag
> > > > > on a standby. I tested StandbyMode, RecoveryInProgress(), and
> > > > > InRecovery. But during restarts, both RecoveryInProgress() and
> > > > > InRecovery are always true on both primary and standby. In all my
> > > > > testing, StandbyMode was the only variable that consistently
> > > > > differentiated between the two, which is what I used.
> > > > >
> > > >
> > > > +/*
> > > > + * ResetSyncedSlots()
> > > > + *
> > > > + * Reset all replication slots that have synced=true to synced=false.
> > > > + */
> > > >
> > > > I feel this is not correct, we are force resetting sync flag status
> > > > for all logical slots, not just the one that is set to true.
> > > >
> > > > --
> > > >
> > > > @@ -2664,6 +2715,10 @@ RestoreSlotFromDisk(const char *name)
> > > > memcpy(&slot->data, &cp.slotdata,
> > > > sizeof(ReplicationSlotPersistentData));
> > > >
> > > > + /* reset synced flag if this is a primary server */
> > > > + if (!StandbyMode)
> > > > + slot->data.synced = false;
> > > > +
> > > >
> > > > I think you also need to ensure that you are only doing this for the
> > > > logical slots, it's currently just checking if the slot is in-use or
> > > > not.
> > > >
> > >
> > > I think a better approach would be to reset synced only if it is
> > > marked as synced. Adding a LogicalSlot check wouldn't be incorrect,
> > > but IMO, it may not be necessary.
> > >
> >
> > Thinking further on this, I believe it’s fine even if the slot is
> > forcefully set to false on the primary. The slot type or its sync
> > status doesn’t really matter in this case.
> >
>
> I think we need to mark the slots dirty once we set its synced-flag to
> false. In such a case, we should only mark those which actually needed
> synced flag resetting. IMO, we need a 'synced' flag check here.
>
Fair enough, point taken.
--
With Regards,
Ashutosh Sharma.
From | Date | Subject | |
---|---|---|---|
Next Message | Chao Li | 2025-09-22 05:54:00 | Re: Trivial fix for comment of function table_tuple_lock |
Previous Message | Ashutosh Sharma | 2025-09-22 05:28:54 | Re: How can end users know the cause of LR slot sync delays? |