Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-11-14 03:52:54
Message-ID: CAJpy0uCbjtsoZJe2NyPnO+_vj0FPiE7NDj1U=w+5M7xcr3UN3w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 12, 2025 at 10:42 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Nov 12, 2025 at 3:42 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Wed, Nov 12, 2025 at 3:36 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Nov 11, 2025 at 6:05 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Nov 10, 2025 at 8:05 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Thu, Nov 6, 2025 at 4:32 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > >
> > > > > > I've updated and rebased the patch.
> > > > > >
> > > > >
> > > > > Thanks for the patch. Please find a few comments:
> > > > >
> > > > >
> > > > > 1)
> > > > > ReplicationSlotsDropDBSlots:
> > > > >
> > > > > + SpinLockAcquire(&s->mutex);
> > > > > + invalidated = s->data.invalidated == RS_INVAL_NONE;
> > > > > + SpinLockRelease(&s->mutex);
> > > > > +
> > > > > + /*
> > > > > + * Count slots on other databases too so we can disable logical
> > > > > + * decoding only if no slots in the cluster.
> > > > > + */
> > > > > + if (invalidated)
> > > > > + n_valid_logicalslots++;
> > > > >
> > > > >
> > > > > This seems confusing to me. Can we instead do:
> > > > >
> > > > > SpinLockAcquire(&s->mutex);
> > > > > if (s->data.invalidated == RS_INVAL_NONE)
> > > > > n_valid_logicalslots++;
> > > > > SpinLockRelease(&s->mutex);
> > > > >
> > > > > 2)
> > > > > InvalidateObsoleteReplicationSlots:
> > > > >
> > > > > + bool islogical = SlotIsLogical(s);
> > > > >
> > > > > /* Prevent invalidation of logical slots during binary upgrade */
> > > > > if (SlotIsLogical(s) && IsBinaryUpgrade)
> > > > > + {
> > > > > + SpinLockAcquire(&s->mutex);
> > > > > + if (s->data.invalidated == RS_INVAL_NONE)
> > > > > + n_valid_logicalslots++;
> > > > > + SpinLockRelease(&s->mutex);
> > > > > +
> > > > > continue;
> > > > > + }
> > > > >
> > > > > We should use 'islogical' instead of SlotIsLogical here.
> > > > >
> > > > > 3)
> > > > > InvalidateObsoleteReplicationSlots() is more robust now as we are
> > > > > using both 'invalidated' and 'released_lock' flags but still nowhere
> > > > > we guarantee that invalidated=true implies released_lock=true. Since
> > > > > we jump to 'restart' label only if released_lock is true, it becomes
> > > > > important to have an ASSERT which says invalidated=true implicitly
> > > > > means released_lock=true or vice versa. Because at the end we go by
> > > > > 'invalidated_logical' rather than 'released_lock' to decide about
> > > > > logical-decoding disabling.
> > > > >
> > > > > In this logic:
> > > > >
> > > > > + if (InvalidatePossiblyObsoleteSlot(possible_causes, s, oldestLSN,
> > > > > + dboid, snapshotConflictHorizon,
> > > > > + &released_lock))
> > > > > {
> > > > > - /* if the lock was released, start from scratch */
> > > > > - goto restart;
> > > > > + /* Remember we have invalidated a physical or logical slot */
> > > > > + invalidated = true;
> > > > > +
> > > > > + /*
> > > > > + * Additionally, remember we have invalidated a logical slot too
> > > > > + * as we can request disabling logical decoding later.
> > > > > + */
> > > > > + if (islogical)
> > > > > + invalidated_logical = true;
> > > > > }
> > > > >
> > > > > Shall we have an Assert(released_lock) if
> > > > > InvalidatePossiblyObsoleteSlot returns true. Or any better way?
> > > > >
> > > > > 4)
> > > > > + SpinLockAcquire(&s->mutex);
> > > > > + if (s->data.invalidated == RS_INVAL_NONE)
> > > > > + n_valid_logicalslots++;
> > > > >
> > > > > In the same function, isn't the above code problematic: Don't we need
> > > > > 'islogical' check before incrementing 'n_valid_logicalslots',
> > > > > otherwise it may wrongly count valid physical slots as well.
> > > >
> > > > Agreed with all the above points. Will fix and update the updated version.
> > > >
> > >
> > > I've attached the updated version patch. I addressed all comments I
> > > got so far, and made some cosmetic changes.
> > >
> >
> > Thanks. A few comments:
> >
> > 1)
> > Shall we update comments atop InvalidateObsoleteReplicationSlots() as
> > well, similar to other functions. Something like:
> >
> > If it invalidates the last logical slot in the cluster, it requests to
> > disable logical decoding.
>
> Okay, added.
>
> >
> > 2)
> > With the new sanity check (Assert(released_lock)) in
> > InvalidateObsoleteReplicationSlots, we have made sure that whenever a
> > slot is invalidated, we do release-lock. But we have not made sure
> > that released_lock=true always implies a slot is invalidated. Looking
> > at InvalidatePossiblyObsoleteSlot(), that seems to be the case always,
> > but shall we have a sanity check in for this as well. Thoughts?
> >
>
> I think it's possible that InvalidatePossiblyObsoleteSlot() releases
> the slot but doesn't invalidate it. For example, after it terminates
> the process owning the slot, the slot gets dropped or its restart_lsn
> (or xmin) gets advanced enough not to be invalidated.

Oh, if released_lock can be true while the slot isn’t actually
invalidated, could we end up resetting 'n_valid_logicalslots' to 0
when we shouldn’t? Or I guess even if that happens, we’re still fine
because the loop restarts and we’ll hit that slot again — and on the
next pass we’ll either invalidate it or not release the lock. Is that
right? I’m just trying to make sure we don’t end up in a situation
where we miss counting a valid slot. And how do we ensure that through
any sanity checks?

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-11-14 03:53:20 Re: DOCS: Missing <structfield> tags for some SEQUENCE fields
Previous Message Peter Smith 2025-11-14 03:30:21 Re: Rename sync_error_count to tbl_sync_error_count in subscription statistics