From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date: | 2025-10-13 22:47:56 |
Message-ID: | CAD21AoCUmrvcb-HgVmuZk=W1j9KKt=JmOFp8H+nXmtOcPcspLg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 13, 2025 at 1:33 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Fri, Oct 10, 2025 at 1:54 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Fri, Oct 10, 2025 at 9:32 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Oct 9, 2025 at 4:14 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > Thank you for the comments.
> > > >
> > > > I've attached the updated patch. As we discussed, it now uses the lazy
> > > > deactivation in all cases and has the discussed points in the comments
> > > > atop logicalctl.c.
> > > >
> > >
> > > Thank You for making the changes. I am in the process of verifying the
> > > patch. Please find one concern.
>
> Thank you for reviewing the patch!
>
> > >
> > > When I promote a standby, there's a brief period during
> > > recovery-completion when status_change_allowed is set to true and also
> > > RecoveryInProgress() is still true. If I try to create a logical
> > > replication slot on the standby during this window, it fails with the
> > > following error:
> > >
> > > postgres=# SELECT pg_create_logical_replication_slot('slot1',
> > > 'pgoutput', false, false, false);
> > > ERROR: logical decoding on standby requires "effective_wal_level" >=
> > > "logical" on the primary
> > > HINT: To enable logical decoding, set "wal_level" >= "logical" or
> > > create at least one logical slot when "wal_level" = "replica".
> > >
> > > Note that the primary already has logical slots. This issue occurs
> > > because logical decoding is disabled by
> > > UpdateLogicalDecodingStatusEndOfRecovery() due to no existing slots on
> > > standby while RecoveryInProgress() is still true. Should
> > > CheckLogicalDecodingRequirements() also consider
> > > 'status_change_allowed' on standby to handle this transition more
> > > gracefully?
>
> Good point. I think we can check status_change_allowed instead of
> RecoveryInProgress(). Fixed the patch accordingly.
>
> >
> > 1)
> > I am creating a subscription for publication present on standby, I get
> > this error:
> >
> > postgres=# CREATE subscription sub1 connection 'dbname=postgres
> > host=localhost user=shveta port=5434' publication pub1;
> > ERROR: could not create replication slot "sub1": ERROR: logical
> > decoding on standby requires "effective_wal_level" >= "logical" on the
> > primary
> > HINT: To enable logical decoding, set "wal_level" >= "logical" or
> > create at least one logical slot when "wal_level" = "replica".
> >
> > Should Hint also mention 'on primary' i.e. 'To enable logical decoding
> > on primary, set..' , otherwise it could be slightly confusing to
> > follow HINT.
>
> Fixed.
>
> >
> > 2)
> > CheckLogicalDecodingRequirements():
> >
> > Is there a reason for moving 'logical decoding on standby requires..'
> > error before 'logical decoding requires a database connection' error?
> > Shall we keep the same order as earlier?
>
> Fixed.
>
> >
> > 3)
> > GetActiveWalLevelOnStandby() is not being used anywhere now.
>
> Right. However, it's an exposed function so I'm somewhat hesitant with
> removing it.
>
> >
> > 4)
> > + if (!old_pending_disable)
> > + {
> > + volatile PROC_HDR *procglobal = ProcGlobal;
> > + ProcNumber checkpointerProc = procglobal->checkpointerProc;
> > +
> > + /* Wake up the checkpointer */
> > + if (checkpointerProc != INVALID_PROC_NUMBER)
> > + SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
> > + }
> >
> > Why are we waking the checkpointer only when pending_disable changes
> > from false to true? When this function is called and
> > old_pending_disable is still true, wouldn’t that be an even stronger
> > reason to wake the checkpointer?
>
> I guess even if we set the checkpointer's latch multiple times, it
> would not help wake it up quickly because the checkpointer might be
> just busy with its checkpointing work. Having said that, it would help
> simplify the code structure a bit. Given that
> RequestDisableLogicalDecoding() function expects to be called only
> after dropping the possibly-last logical slot, it would not be harmful
> even if we set its latch anyway. And it might help the case where the
> checkpointer is temporarily not working and we could not set it latch.
> So I took your idea. Thanks!
>
> >
> > 5)
> > + /*
> > + * Update shmem flags. We don't need to care about the order of setting
> > + * global flag and writing the WAL record writes are not allowed yet.
> > + */
> >
> > Comment is somewhat problematic.
> >
> > 6)
> >
> > ReportSlotInvalidation():
> > + appendStringInfoString(&err_detail, _("Logical decoding on standby
> > requires \"wal_level\" >= \"logical\" or at least one logical slot on
> > the primary server."));
> >
> > This message could be a little ambiguous. It is not completely clear
> > that that wal_level=logical is needed on standby or on primary. How
> > about below msg:
> >
> > Logical decoding on standby requires the primary server to have either
> > wal_level >= logical or at least one logical replication slot created.
> >
> > 7)
> > + "effective_wal_level got decreased to 'replica' on the primary to
> > invalidate stnadby's slots"
> >
> > stnadby --> standby
>
> Fixed above points.
>
> >
> > 8)
> > + # Trigger promotion with no wait for the startup process to reach the
> > + # injection point.
> > + $standby5->safe_psql('postgres', qq[select pg_promote(false)]);
> > + note('promote the standby and waiting for injection_point');
> > + $standby5->wait_for_event('startup',
> > + 'startup-logical-decoding-status-change-end-of-recovery');
> > + note(
> > + "injection_point
> > 'startup-logical-decoding-status-change-end-of-recovery' is reached"
> > + );
> >
> > Is the comment correct here? We are waiting to reach the injection
> > point? But the comment says 'no wait for'?
>
> I meant to promote the server without wait (i.e., passing false to
> pg_promote() function). I modified the comments.
>
> >
> > 9)
> > + # Drop the logical slot, requesting to disable logical decoding to
> > the checkpointer.
> > + # It has to wait for the recovery to complete before disabling
> > logical decoding.
> >
> > Double space in 'It has'.
> >
> > 10)
> > + # Drop the logical slot, requesting to disable logical decoding to
> > the checkpointer.
> > + # It has to wait for the recovery to complete before disabling
> > logical decoding.
> > + $standby5->safe_psql('postgres',
> > + qq[select pg_drop_replication_slot('standby5_slot');]);
> > +
> > + # Resume the startup process to complete the recovery.
> > + $standby5->safe_psql('postgres',
> > + qq[select injection_points_wakeup('startup-logical-decoding-status-change-end-of-recovery')]
> > + );
> >
> > I think between these 2 steps, we should check logs for 'waiting for
> > recovery completion to change logical decoding status'.
> > The test will become more clear then.
>
> Fixed the above points.
>
> I've attached the updated version patch.
I noticed that cfbot tests failed and the patch needs to be updated to
the current HEAD. I've attached the updated patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v18-0001-Enable-logical-decoding-dynamically-based-on-log.patch | application/octet-stream | 100.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-10-13 23:32:56 | Re: Fixed a typo in comment in compress_lz4.c |
Previous Message | Masahiko Sawada | 2025-10-13 22:35:06 | Re: Executing pg_createsubscriber with a non-compatible control file |