Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-09-18 03:56:44
Message-ID: CAA4eK1JVNbb-OT1PO=iOFG1KA__Q83n8cLZoDjF2yA1rZyvCnA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 17, 2025 at 10:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Sep 17, 2025 at 4:19 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Sep 16, 2025 at 11:49 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 1:30 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > When user is dropping a temporary slot, we should disable the
> > > > decoding. The lazy behaviour should be for ERROR or session_exit
> > > > cases.
> > >
> > > I think it might be worth discussing whether to use lazy behavior in
> > > all cases.
> > >
> >
> > Agreed.
> >
> > > There are several advantages:
> > >
> > > - It mitigates the risk of connection timeouts during a logical slot
> > > drop or a subscription drop.
> > > - In scenarios involving frequent creation and deletion of logical
> > > slots (such as during initial data synchronization), it could
> > > potentially avoid the issue of a frequent switch on and off.
> > >
> > > On the other hand, drawbacks are:
> > >
> > > - users would have to wait for effective_wal_level to get decreased to
> > > 'replica' somehow.
> > > - makes the checkpointer more busy in addition to its checkpointing job.
> > > - it could take a longer time to disable logical decoding if the
> > > checkpoint is busy with a checkpointing job.
> > >
> >
> > This last point in drawback could hurt performance of systems for a
> > longer time when that was really not required. It should be okay to
> > use lazy behavior in all cases when we can do that in a predictable
> > time.
>
> Agreed.
>
> If we use the lazy behavior in ERROR or session_exit cases, we would
> have these drawbacks anyway. But assuming it won't happen frequently
> in practice, we can live with that.
>
> > The other background process to consider doing lazy processing
> > is the launcher whose role is to launch apply workers for subscription
> > and maintain a conflict_slot (if required). Now, because disabling
> > logical_info could also take longer time in worst cases, the
> > launcher's own tasks can become unpredictable. Also, if tomorrow, we
> > decide to support dynamically changing wal_level from minimal to some
> > upper level, the launcher won't be the appropriate process.
>
> Right. Also, we don't launch the launcher process when
> max_logical_replication_workers == 0. It should be >0 on the
> subscriber but might not be on the publisher.
>
> >
> > The other idea could be to have a new auxiliary process to disable
> > logical_info lazily. It is arguable if we just have a separate process
> > for this purpose but we have previously discussed some other tasks for
> > such a process like removal of old_serialized_snapshots and
> > old_logical_ rewrite_map files. See [1]. If we agree to have a
> > separate process for this purpose then disabling logical_info in all
> > cases sounds okay to me.
>
> Yeah, the custodian worker would be one solution. But please refer to
> subsequent discussions[1][2];
>

I think Tom's idea of spawning the worker on need basis has some use
here, like, during drop_slot, we can launch the worker to complete
this task and then exit to ameliorate the risk of connection_timeout
for drop subscription cases. However, we can consider such ideas as an
iterative improvements as well.

there might not be other tasks to
> delegate to the custodian worker than this logical decoding
> deactivation, and it might be not optimal to have a single worker that
> is responsible for all custodian works. Actually we've discussed a
> similar idea on this thread and I drafted a patch[3] that utilizes
> bgworkers to do internal tasks in the background in a
> one-task-per-one-worker manner.
>
> It requires more discussion anyway if we want to go with this
> direction. I think we can start with using lazy behavior in ERROR or
> session_exit cases (assuming it won't happen frequently in practice),
> and consider using lazy behavior other cases if it's really
> preferable.
>

Fair enough. So, let's proceed with this plan (use lazy behavior in
ERROR and session_exit cases) and see how it works. BTW, we also need
to consider ERROR cases when the slot is dropped but we failed to
disable the logical_info due to any random ERROR.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-09-18 04:28:43 Re: Orphan page in _bt_split
Previous Message Michael Paquier 2025-09-18 03:53:40 Re: Incorrect logic in XLogNeedsFlush()