Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-09-09 18:44:54
Message-ID: CAD21AoATKbc=tLKBKQ46hKYWXW7+CvW9U3EYMjabVh=uNrr18Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 8, 2025 at 11:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Sep 8, 2025 at 11:22 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Sep 5, 2025 at 9:12 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Sat, Sep 6, 2025 at 3:58 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Sep 2, 2025 at 5:12 AM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> > > > >
> > > > >
> > > > > I tested the behaviour with HEAD and with Patch. And I confirmed the
> > > > > change in behaviour between HEAD and Patch
> > > > >
> > > > > Suppose we have a primary and a standby with wal_level = logical and
> > > > > guc parameters to enable slot sync worker are set accordingly. A slot
> > > > > sync worker will be running.
> > > > > Now we change the value of wal_level for primary to replica. And
> > > > > restart the primary server
> > > > >
> > > > > With HEAD, during restart the existing sync_slot_worker will exit with:
> > > > > 2025-09-02 11:49:08.846 IST [3877882] ERROR: synchronization worker
> > > > > "" could not connect to the primary server: connection to server at
> > > > > "localhost" (127.0.0.1), port 5432 failed: Connection refused
> > > > > Is the server running on that host and accepting TCP/IP connections?
> > > > > 2025-09-02 11:49:11.380 IST [3877885] FATAL: streaming replication
> > > > > receiver "walreceiver" could not connect to the primary server:
> > > > > connection to server at "localhost" (127.0.0.1), port 5432 failed:
> > > > > Connection refused
> > > > > Is the server running on that host and accepting TCP/IP connections?
> > > > >
> > > > > and after the restart of the primary server, slot sync worker will
> > > > > restart and it is able to connect to the primary.
> > > > >
> > > > > With Patch, during restart the existing sync_slot_worker will exit.
> > > > > But after the restart of the primary server, slot sync worker cannot
> > > > > start and we can see following log:
> > > > > 2025-09-02 12:44:51.497 IST [3947520] LOG: replication slot
> > > > > synchronization worker is shutting down on receiving SIGINT
> > > > > 2025-09-02 12:44:51.498 IST [3943504] LOG: replication slot
> > > > > synchronization requires logical decoding to be enabled
> > > > > 2025-09-02 12:44:51.498 IST [3943504] HINT: To enable logical
> > > > > decoding on primary, set "wal_level" >= "logical" or create at least
> > > > > one logical slot when "wal_level" = "replica".
> > > > > 2025-09-02 12:45:51.537 IST [3943504] LOG: replication slot
> > > > > synchronization requires logical decoding to be enabled
> > > > > 2025-09-02 12:45:51.537 IST [3943504] HINT: To enable logical
> > > > > decoding on primary, set "wal_level" >= "logical" or create at least
> > > > > one logical slot when "wal_level" = "replica".
> > > > >
> > > > > So, with HEAD, after we restart the primary server with 'wal_level =
> > > > > replica', the slot sync worker can restart and connect to the primary
> > > > > but with patch it cannot start after restart due to the check in
> > > > > ValidateSlotSyncParams.
> > > >
> > > > But the slotsync worker is launched again once logical decoding is
> > > > enabled, no? I'm not sure that we want to launch the slotsync worker
> > > > also when we know logical decoding is not enabled.
> > > >
> > >
> > > Why in the first place the logical_decoding enabled check has failed
> > > because IIUC, the wal_level on standby is still 'logical'?
> >
> > This is because logical decoding on standbys can be used only when the
> > standby's effective_wal_level is 'logical', which also means the
> > primary's effective_wal_level is 'logical' too. This behavior is
> > mostly the same as today; logical decoding on standbys can be used
> > only when both the primary and the standbys set wal_level to
> > 'logical'. Even if standby's wal_level is set to logical, it doesn't
> > mean that incoming WAL records are generated on the primary with the
> > information required by logical decoding.
> >
>
> This is true but IIUC Shlok's report says that we are able to restart
> server before patch and not after patch. Am, I missing something? If
> not, then shouldn't this be fixed separately first?

I've reread his report. IIUC what happened in his test scenario was;
while he was restarting the primary server (to make
wal_level='replica' effect), the slotsync worker exited due to a
connection error. Then after the primary started up, with the patch,
the slotsync worker was not launched again, whereas it was launched
again without the patch. This is because with the patch, the standby
disables the logical decoding when replaying the STATUS_CHANGE record.
If the primary enables logical decoding again, the STATUS_CHANGE
record with logical_decoding=true is replicated to the standby and it
launches the slotsync worker again. That is, the slotsync worker
launches based on the standby's effective_wal_level. On the other
hand, before the patch, the slotsync worker is launched solely based
on the standby's wal_level. Therefore, it launches but doesn't do
anything in this case (as the primary should not have any logical
slot). I thought it makes sense that we don't launch the slotsync
worker when effective_wal_level is 'replica', but is your suggestion
that the slotsync worker needs to be launched only when the standby's
wal_level is logical regardless of effective_wal_level?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-09-09 18:53:38 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message Sophie Alpert 2025-09-09 18:20:47 Re: Fix missing EvalPlanQual recheck for TID scans