Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-08-27 17:54:11
Message-ID: CAD21AoDtfZ0P_zMNauVM4FBXrQx-yU7ms-Rcem2b2RusKeWn8A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 27, 2025 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Sawada-san,
>
> Thanks for updating the patch. Here are my comments.

Thank you for reviewing the patch!

>
> xlog_desc()
> ```
> else if (info == XLOG_LOGICAL_DECODING_STATUS_CHANGE)
> {
> bool enabled;
>
> memcpy(&enabled, rec, sizeof(bool));
> appendStringInfo(buf, enabled ? "true" : "false");
> }
> ```
>
> Per 2075ba9, appendStringInfoString() can be used if we do not have other messages.

Agreed, will fix.

>
> logicalctl.h
> ```
> extern void UpdateNumberOfLogicalSlots(bool incr);
> ```
>
> This function is not implemented.

Removed.

>
> UpdateLogicalDecodingStatus()
> ```
> elog(DEBUG1, "update logical decoding status to %d", new_status);
> ```
>
> I prefer to use true/false instead of 1/0, thought?

I think we don't necessarily need it as it's a debug log.

> xlog_redo()
> ```
> /* Update the status on shared memory */
> memcpy(&logical_decoding, XLogRecGetData(record), sizeof(bool));
> UpdateLogicalDecodingStatus(logical_decoding, true);
>
> if (InRecovery && InHotStandby)
> {
> if (!logical_decoding)
> {
> /*
> * Invalidate logical slots if we are in hot standby and the
> * primary disabled the logical decoding.
> */
> InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL,
> 0, InvalidOid,
> InvalidTransactionId);
>
> ```
>
> Assuming that logical_decoding written in the WAL is false here, and a logical
> replication slot is created just after that. In my experiments below happened:
>

Let me clarify each step:

> 1. startup process updated logical_decoding_enabled to false, at line 8652.

I assume that logical_decoding_enabled was enabled before step 1.

> 2. slotsync worker started to sync. Surprisingly, it created a (second) logical
> slot and started logical decoding with fast_foward mode.

I guess that the postmaster launched the slotsync worker before the
startup changes the status since logical decoding was enabled as I
mentioned above, which seems fine to me.

> 3. startup invalidated logical slots due to the wal_level. the slot created at
> step2 was automatically dropped, because it was not sync-readly yet.
> 4. startup process shut down the slotsync worker.
> 5. start process read the STATUS_CHANGE record again, which has the value "true".
> it requested to restart the sync worker.
> 6. restarted sync worker synchronize the slot again...
>
> For me it works well but it is bit a strange because 1) logical decoding is
> started even when effective_wal_level is false,

I think it's a race condition between the postmaster and the startup,
it could happen even between the backend and the startup; the startup
disables logical decoding right after the backend passes
CheckLogicalDecodingRequirements() check. I think it's technically
okay since all WAL records before the STATUS_CHANGE should have the
logical information. Even if it starts to do logical decoding, it
would end up decoding the STATUS_CHANGE record and with an error (see
xlog_decode()).

> and 2) the synced slot is
> dropped once with below message:
>
> ```
> LOG: terminating process 1474448 to release replication slot "test2"
> DETAIL: Logical decoding on standby requires "wal_level" >= "logical" or at least one logical slot on the primary server.
> CONTEXT: WAL redo at 0/030000B8 for XLOG/LOGICAL_DECODING_STATUS_CHANGE: false
> ERROR: canceling statement due to conflict with recovery
> DETAIL: User was using a logical replication slot that must be invalidated.
> ```
>
> Can we stop the sync worker before updating the status? IIUC this is one of the
> solution.

I think it would lead to another race condition; the slotsync worker
can start again before updating the status.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-08-27 18:29:49 Re: Parallel heap vacuum
Previous Message David E. Wheeler 2025-08-27 17:13:54 Re: ABI Compliance Checker GSoC Project