Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-09-17 16:53:37
Message-ID: CAD21AoALaRUZkec7+XL_vFn0=wW8UbObS=FhymUK=zOeHxTMow@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 17, 2025 at 4:19 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Sep 16, 2025 at 11:49 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Sep 16, 2025 at 1:30 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > When user is dropping a temporary slot, we should disable the
> > > decoding. The lazy behaviour should be for ERROR or session_exit
> > > cases.
> >
> > I think it might be worth discussing whether to use lazy behavior in
> > all cases.
> >
>
> Agreed.
>
> > There are several advantages:
> >
> > - It mitigates the risk of connection timeouts during a logical slot
> > drop or a subscription drop.
> > - In scenarios involving frequent creation and deletion of logical
> > slots (such as during initial data synchronization), it could
> > potentially avoid the issue of a frequent switch on and off.
> >
> > On the other hand, drawbacks are:
> >
> > - users would have to wait for effective_wal_level to get decreased to
> > 'replica' somehow.
> > - makes the checkpointer more busy in addition to its checkpointing job.
> > - it could take a longer time to disable logical decoding if the
> > checkpoint is busy with a checkpointing job.
> >
>
> This last point in drawback could hurt performance of systems for a
> longer time when that was really not required. It should be okay to
> use lazy behavior in all cases when we can do that in a predictable
> time.

Agreed.

If we use the lazy behavior in ERROR or session_exit cases, we would
have these drawbacks anyway. But assuming it won't happen frequently
in practice, we can live with that.

> The other background process to consider doing lazy processing
> is the launcher whose role is to launch apply workers for subscription
> and maintain a conflict_slot (if required). Now, because disabling
> logical_info could also take longer time in worst cases, the
> launcher's own tasks can become unpredictable. Also, if tomorrow, we
> decide to support dynamically changing wal_level from minimal to some
> upper level, the launcher won't be the appropriate process.

Right. Also, we don't launch the launcher process when
max_logical_replication_workers == 0. It should be >0 on the
subscriber but might not be on the publisher.

>
> The other idea could be to have a new auxiliary process to disable
> logical_info lazily. It is arguable if we just have a separate process
> for this purpose but we have previously discussed some other tasks for
> such a process like removal of old_serialized_snapshots and
> old_logical_ rewrite_map files. See [1]. If we agree to have a
> separate process for this purpose then disabling logical_info in all
> cases sounds okay to me.

Yeah, the custodian worker would be one solution. But please refer to
subsequent discussions[1][2]; there might not be other tasks to
delegate to the custodian worker than this logical decoding
deactivation, and it might be not optimal to have a single worker that
is responsible for all custodian works. Actually we've discussed a
similar idea on this thread and I drafted a patch[3] that utilizes
bgworkers to do internal tasks in the background in a
one-task-per-one-worker manner.

It requires more discussion anyway if we want to go with this
direction. I think we can start with using lazy behavior in ERROR or
session_exit cases (assuming it won't happen frequently in practice),
and consider using lazy behavior other cases if it's really
preferable.

Regards,

[1] https://www.postgresql.org/message-id/1058306.1680467858%40sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/20230402184226.kkjplqvqu6utvzbt%40awork3.anarazel.de
[3] https://www.postgresql.org/message-id/CAD21AoCPc%2BpEgb0pJeiS2CU39ad8VW-10Ze7Uii%3D1RRjfgQ0uw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-09-17 17:17:30 Re: Parallel heap vacuum
Previous Message Fujii Masao 2025-09-17 16:52:46 Re: Suggestion to add --continue-client-on-abort option to pgbench