Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-10-13 20:33:01
Message-ID: CAD21AoAFvW0zTBgEvoFJ_4qukDRV0AwFNBjbpNa9-MOFesGyhA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Oct 10, 2025 at 1:54 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Fri, Oct 10, 2025 at 9:32 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Thu, Oct 9, 2025 at 4:14 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > Thank you for the comments.
> > >
> > > I've attached the updated patch. As we discussed, it now uses the lazy
> > > deactivation in all cases and has the discussed points in the comments
> > > atop logicalctl.c.
> > >
> >
> > Thank You for making the changes. I am in the process of verifying the
> > patch. Please find one concern.

Thank you for reviewing the patch!

> >
> > When I promote a standby, there's a brief period during
> > recovery-completion when status_change_allowed is set to true and also
> > RecoveryInProgress() is still true. If I try to create a logical
> > replication slot on the standby during this window, it fails with the
> > following error:
> >
> > postgres=# SELECT pg_create_logical_replication_slot('slot1',
> > 'pgoutput', false, false, false);
> > ERROR: logical decoding on standby requires "effective_wal_level" >=
> > "logical" on the primary
> > HINT: To enable logical decoding, set "wal_level" >= "logical" or
> > create at least one logical slot when "wal_level" = "replica".
> >
> > Note that the primary already has logical slots. This issue occurs
> > because logical decoding is disabled by
> > UpdateLogicalDecodingStatusEndOfRecovery() due to no existing slots on
> > standby while RecoveryInProgress() is still true. Should
> > CheckLogicalDecodingRequirements() also consider
> > 'status_change_allowed' on standby to handle this transition more
> > gracefully?

Good point. I think we can check status_change_allowed instead of
RecoveryInProgress(). Fixed the patch accordingly.

>
> 1)
> I am creating a subscription for publication present on standby, I get
> this error:
>
> postgres=# CREATE subscription sub1 connection 'dbname=postgres
> host=localhost user=shveta port=5434' publication pub1;
> ERROR: could not create replication slot "sub1": ERROR: logical
> decoding on standby requires "effective_wal_level" >= "logical" on the
> primary
> HINT: To enable logical decoding, set "wal_level" >= "logical" or
> create at least one logical slot when "wal_level" = "replica".
>
> Should Hint also mention 'on primary' i.e. 'To enable logical decoding
> on primary, set..' , otherwise it could be slightly confusing to
> follow HINT.

Fixed.

>
> 2)
> CheckLogicalDecodingRequirements():
>
> Is there a reason for moving 'logical decoding on standby requires..'
> error before 'logical decoding requires a database connection' error?
> Shall we keep the same order as earlier?

Fixed.

>
> 3)
> GetActiveWalLevelOnStandby() is not being used anywhere now.

Right. However, it's an exposed function so I'm somewhat hesitant with
removing it.

>
> 4)
> + if (!old_pending_disable)
> + {
> + volatile PROC_HDR *procglobal = ProcGlobal;
> + ProcNumber checkpointerProc = procglobal->checkpointerProc;
> +
> + /* Wake up the checkpointer */
> + if (checkpointerProc != INVALID_PROC_NUMBER)
> + SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
> + }
>
> Why are we waking the checkpointer only when pending_disable changes
> from false to true? When this function is called and
> old_pending_disable is still true, wouldn’t that be an even stronger
> reason to wake the checkpointer?

I guess even if we set the checkpointer's latch multiple times, it
would not help wake it up quickly because the checkpointer might be
just busy with its checkpointing work. Having said that, it would help
simplify the code structure a bit. Given that
RequestDisableLogicalDecoding() function expects to be called only
after dropping the possibly-last logical slot, it would not be harmful
even if we set its latch anyway. And it might help the case where the
checkpointer is temporarily not working and we could not set it latch.
So I took your idea. Thanks!

>
> 5)
> + /*
> + * Update shmem flags. We don't need to care about the order of setting
> + * global flag and writing the WAL record writes are not allowed yet.
> + */
>
> Comment is somewhat problematic.
>
> 6)
>
> ReportSlotInvalidation():
> + appendStringInfoString(&err_detail, _("Logical decoding on standby
> requires \"wal_level\" >= \"logical\" or at least one logical slot on
> the primary server."));
>
> This message could be a little ambiguous. It is not completely clear
> that that wal_level=logical is needed on standby or on primary. How
> about below msg:
>
> Logical decoding on standby requires the primary server to have either
> wal_level >= logical or at least one logical replication slot created.
>
> 7)
> + "effective_wal_level got decreased to 'replica' on the primary to
> invalidate stnadby's slots"
>
> stnadby --> standby

Fixed above points.

>
> 8)
> + # Trigger promotion with no wait for the startup process to reach the
> + # injection point.
> + $standby5->safe_psql('postgres', qq[select pg_promote(false)]);
> + note('promote the standby and waiting for injection_point');
> + $standby5->wait_for_event('startup',
> + 'startup-logical-decoding-status-change-end-of-recovery');
> + note(
> + "injection_point
> 'startup-logical-decoding-status-change-end-of-recovery' is reached"
> + );
>
> Is the comment correct here? We are waiting to reach the injection
> point? But the comment says 'no wait for'?

I meant to promote the server without wait (i.e., passing false to
pg_promote() function). I modified the comments.

>
> 9)
> + # Drop the logical slot, requesting to disable logical decoding to
> the checkpointer.
> + # It has to wait for the recovery to complete before disabling
> logical decoding.
>
> Double space in 'It has'.
>
> 10)
> + # Drop the logical slot, requesting to disable logical decoding to
> the checkpointer.
> + # It has to wait for the recovery to complete before disabling
> logical decoding.
> + $standby5->safe_psql('postgres',
> + qq[select pg_drop_replication_slot('standby5_slot');]);
> +
> + # Resume the startup process to complete the recovery.
> + $standby5->safe_psql('postgres',
> + qq[select injection_points_wakeup('startup-logical-decoding-status-change-end-of-recovery')]
> + );
>
> I think between these 2 steps, we should check logs for 'waiting for
> recovery completion to change logical decoding status'.
> The test will become more clear then.

Fixed the above points.

I've attached the updated version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v17-0001-Enable-logical-decoding-dynamically-based-on-log.patch application/octet-stream 99.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-10-13 20:33:56 Re: Thoughts on a "global" client configuration?
Previous Message Christoph Berg 2025-10-13 20:30:52 Re: Thoughts on a "global" client configuration?