From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date: | 2025-08-29 09:46:32 |
Message-ID: | CANhcyEXVyPS74B+Nmwfa3132agkZEDEv+Cg1xu9fp+5ppKx=Ww@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 29 Aug 2025 at 09:38, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Aug 27, 2025 at 7:45 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Dear Sawada-san,
> >
> > > > Assuming that logical_decoding written in the WAL is false here, and a logical
> > > > replication slot is created just after that. In my experiments below happened:
> > > >
> > >
> > > Let me clarify each step:
> > >
> > > > 1. startup process updated logical_decoding_enabled to false, at line 8652.
> > >
> > > I assume that logical_decoding_enabled was enabled before step 1.
> >
> > Right. Initially logical replication slot exist on both primary and standby.
> > More detail; the standby slot was created by the slotsync worker.
> >
> > > > 2. slotsync worker started to sync. Surprisingly, it created a (second) logical
> > > > slot and started logical decoding with fast_foward mode.
> > >
> > > I guess that the postmaster launched the slotsync worker before the
> > > startup changes the status since logical decoding was enabled as I
> > > mentioned above, which seems fine to me.
> >
> > As you said, the slotsync worker has already been launched when the status is
> > changed. I felt logical slot() should not be created after the status on the shared
> > memory is changed.
> >
> > > > 3. startup invalidated logical slots due to the wal_level. the slot created at
> > > > step2 was automatically dropped, because it was not sync-readly yet.
> > > > 4. startup process shut down the slotsync worker.
> > > > 5. start process read the STATUS_CHANGE record again, which has the value
> > > "true".
> > > > it requested to restart the sync worker.
> > > > 6. restarted sync worker synchronize the slot again...
> > > >
> > > > For me it works well but it is bit a strange because 1) logical decoding is
> > > > started even when effective_wal_level is false,
> > >
> > > I think it's a race condition between the postmaster and the startup,
> > > it could happen even between the backend and the startup; the startup
> > > disables logical decoding right after the backend passes
> > > CheckLogicalDecodingRequirements() check. I think it's technically
> > > okay since all WAL records before the STATUS_CHANGE should have the
> > > logical information. Even if it starts to do logical decoding, it
> > > would end up decoding the STATUS_CHANGE record and with an error (see
> > > xlog_decode()).
>
> My understanding of where the synced slot starts to move was not
> right; it starts from the remote slot's restart_lsn, which could be
> far ahead from the STATUS_CHANGE record that the startup process is
> applying but where logical decoding should be enabled. It doesn't
> happen that the slotsync worker tries to decode non-logical WAL
> records even if it advances the slot after the startup disabled
> logical decoding.
>
> > To clarify, are you thinking that it is no need to be fixed, because eventually
> > the system becomes the appropriate state, right?
>
> IIUC you're concerned it's possible that the slotsync worker creates
> or advances a logical slot between the startup changes the logical
> decoding status to false and sends the stop signal. TBH I have no idea
> how efficiently to fix it. I've considered a simple idea that the
> slotsync worker checks IsLogicalDecodingEnabled() before trying to
> sync one logical slot. However, it doesn't solve the race condition;
> the startup process can disable logical decoding right after the
> slotsync passed the check, in which case users would see the logical
> slot is created after logical decoding is disabled.
>
> Another race condition that we might need to deal with is, the
> slotsync worker is launched while logical decoding is still enabled,
> but if the startup sends the stop signal to the slotsync worker before
> the worker sets its pid to SlotSyncCtx->pid, the worker will keep
> running. I've added the check !IsLogicalDecodingEnabled() to the
> slotsync worker's initialization.
>
> >
> > > > and 2) the synced slot is
> > > > dropped once with below message:
> > > >
> > > > ```
> > > > LOG: terminating process 1474448 to release replication slot "test2"
> > > > DETAIL: Logical decoding on standby requires "wal_level" >= "logical" or at
> > > least one logical slot on the primary server.
> > > > CONTEXT: WAL redo at 0/030000B8 for
> > > XLOG/LOGICAL_DECODING_STATUS_CHANGE: false
> > > > ERROR: canceling statement due to conflict with recovery
> > > > DETAIL: User was using a logical replication slot that must be invalidated.
> > > > ```
> > > >
> > > > Can we stop the sync worker before updating the status? IIUC this is one of the
> > > > solution.
> > >
> > > I think it would lead to another race condition; the slotsync worker
> > > can start again before updating the status.
> >
> > Hmm, okay.
> >
> > Another small comment: this data structure is not used in other files, no need to set extern.
> >
> > ```
> > extern LogicalDecodingCtlData *LogicalDecodingCtl;
> > ```
>
> Removed.
>
> I've attached the updated patch.
>
Hi Sawada-san,
Thanks for the updated patch.
I have a doubt. When we create publication (when wal_level is set to
replica) we get a warning:
WARNING: logical decoding needs to be enabled to publish logical changes
HINT: Before creating subscriptions, set "wal_level" >= "logical" or
create a logical replication slot when "wal_level" = "replica".
The hint suggests that when wal_level = 'replica', before creating a
subscription, we should create logical slots on the publisher. But
when I tested this scenario, I created a subscription (without having
a prior logical slot on the publisher). The operation was successful,
the effective_wal_level was set appropriately and logical replication
was working fine. I think this happens because the CREATE SUBSCRIPTION
command itself creates a logical slot on the publisher.
Should we update the HINT message here?
Thanks,
Shlok Kyal
From | Date | Subject | |
---|---|---|---|
Next Message | Joel Jacobson | 2025-08-29 09:51:12 | Re: Assert single row returning SQL-standard functions |
Previous Message | Julien Tachoires | 2025-08-29 09:17:21 | Re: Qual push down to table AM |