From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date: | 2025-08-29 04:07:41 |
Message-ID: | CAD21AoAz1RkCfs-VD6Sm9bCFKiDC=9O-KAtcjxXeL76O3z8PaQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Aug 27, 2025 at 7:45 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Sawada-san,
>
> > > Assuming that logical_decoding written in the WAL is false here, and a logical
> > > replication slot is created just after that. In my experiments below happened:
> > >
> >
> > Let me clarify each step:
> >
> > > 1. startup process updated logical_decoding_enabled to false, at line 8652.
> >
> > I assume that logical_decoding_enabled was enabled before step 1.
>
> Right. Initially logical replication slot exist on both primary and standby.
> More detail; the standby slot was created by the slotsync worker.
>
> > > 2. slotsync worker started to sync. Surprisingly, it created a (second) logical
> > > slot and started logical decoding with fast_foward mode.
> >
> > I guess that the postmaster launched the slotsync worker before the
> > startup changes the status since logical decoding was enabled as I
> > mentioned above, which seems fine to me.
>
> As you said, the slotsync worker has already been launched when the status is
> changed. I felt logical slot() should not be created after the status on the shared
> memory is changed.
>
> > > 3. startup invalidated logical slots due to the wal_level. the slot created at
> > > step2 was automatically dropped, because it was not sync-readly yet.
> > > 4. startup process shut down the slotsync worker.
> > > 5. start process read the STATUS_CHANGE record again, which has the value
> > "true".
> > > it requested to restart the sync worker.
> > > 6. restarted sync worker synchronize the slot again...
> > >
> > > For me it works well but it is bit a strange because 1) logical decoding is
> > > started even when effective_wal_level is false,
> >
> > I think it's a race condition between the postmaster and the startup,
> > it could happen even between the backend and the startup; the startup
> > disables logical decoding right after the backend passes
> > CheckLogicalDecodingRequirements() check. I think it's technically
> > okay since all WAL records before the STATUS_CHANGE should have the
> > logical information. Even if it starts to do logical decoding, it
> > would end up decoding the STATUS_CHANGE record and with an error (see
> > xlog_decode()).
My understanding of where the synced slot starts to move was not
right; it starts from the remote slot's restart_lsn, which could be
far ahead from the STATUS_CHANGE record that the startup process is
applying but where logical decoding should be enabled. It doesn't
happen that the slotsync worker tries to decode non-logical WAL
records even if it advances the slot after the startup disabled
logical decoding.
> To clarify, are you thinking that it is no need to be fixed, because eventually
> the system becomes the appropriate state, right?
IIUC you're concerned it's possible that the slotsync worker creates
or advances a logical slot between the startup changes the logical
decoding status to false and sends the stop signal. TBH I have no idea
how efficiently to fix it. I've considered a simple idea that the
slotsync worker checks IsLogicalDecodingEnabled() before trying to
sync one logical slot. However, it doesn't solve the race condition;
the startup process can disable logical decoding right after the
slotsync passed the check, in which case users would see the logical
slot is created after logical decoding is disabled.
Another race condition that we might need to deal with is, the
slotsync worker is launched while logical decoding is still enabled,
but if the startup sends the stop signal to the slotsync worker before
the worker sets its pid to SlotSyncCtx->pid, the worker will keep
running. I've added the check !IsLogicalDecodingEnabled() to the
slotsync worker's initialization.
>
> > > and 2) the synced slot is
> > > dropped once with below message:
> > >
> > > ```
> > > LOG: terminating process 1474448 to release replication slot "test2"
> > > DETAIL: Logical decoding on standby requires "wal_level" >= "logical" or at
> > least one logical slot on the primary server.
> > > CONTEXT: WAL redo at 0/030000B8 for
> > XLOG/LOGICAL_DECODING_STATUS_CHANGE: false
> > > ERROR: canceling statement due to conflict with recovery
> > > DETAIL: User was using a logical replication slot that must be invalidated.
> > > ```
> > >
> > > Can we stop the sync worker before updating the status? IIUC this is one of the
> > > solution.
> >
> > I think it would lead to another race condition; the slotsync worker
> > can start again before updating the status.
>
> Hmm, okay.
>
> Another small comment: this data structure is not used in other files, no need to set extern.
>
> ```
> extern LogicalDecodingCtlData *LogicalDecodingCtl;
> ```
Removed.
I've attached the updated patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v12-0001-Enable-logical-decoding-dynamically-based-on-log.patch | application/octet-stream | 94.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Yurii Rashkovskii | 2025-08-29 04:22:44 | Saving and restoring InterruptHoldoffCount |
Previous Message | jian he | 2025-08-29 04:07:24 | Re: misleading error message in ProcessUtilitySlow T_CreateStatsStmt |