Re: Add an option to skip loading missing publication to avoid logical replication failure

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add an option to skip loading missing publication to avoid logical replication failure
Date: 2025-05-04 13:14:09
Message-ID: CALDaNm27gUnMG5-gdBLnWH_+4G+EZ_78MA2h8fbGPm9o5LjySA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2 May 2025 at 09:23, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Fri, 2 May 2025 at 06:30, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > vignesh C <vignesh21(at)gmail(dot)com> writes:
> > > I agree with your analysis. I was able to reproduce the issue by
> > > delaying the invalidation of the subscription until the walsender
> > > finished decoding the INSERT operation following the ALTER
> > > SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
> > > of the INSERT after the ALTER SUBSCRIPTION.
> >
> > Can you be a little more specific about how you reproduced this?
> > I tried inserting sleep() calls in various likely-looking spots
> > and could not get a failure that way.
>
> Test Steps:
> 1) Set up logical replication:
> Create a publication on the publisher
> Create a subscription on the subscriber
> 2) Create the following table on the publisher:
> CREATE TABLE tab_3 (a int);
> 3) Create the same table on the subscriber:
> CREATE TABLE tab_3 (a int);
> 4) On the subscriber, alter the subscription to refer to a
> non-existent publication:
> ALTER SUBSCRIPTION sub1 SET PUBLICATION tap_pub_3;
> 5) Insert data on the publisher:
> INSERT INTO tab_3 VALUES (1);
>
> As expected, the publisher logs the following warning in normal case:
> 2025-05-02 08:56:45.350 IST [516197] WARNING: skipped loading
> publication: tap_pub_3
> 2025-05-02 08:56:45.350 IST [516197] DETAIL: The publication does
> not exist at this point in the WAL.
> 2025-05-02 08:56:45.350 IST [516197] HINT: Create the publication
> if it does not exist.
>
> To simulate a delay in subscription invalidation, I modified the
> maybe_reread_subscription() function as follows:
> diff --git a/src/backend/replication/logical/worker.c
> b/src/backend/replication/logical/worker.c
> index 4151a4b2a96..0831784aca3 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -3970,6 +3970,10 @@ maybe_reread_subscription(void)
> MemoryContext oldctx;
> Subscription *newsub;
> bool started_tx = false;
> + bool test = true;
> +
> + if (test)
> + return;
>
> This change delays the subscription invalidation logic, preventing the
> apply worker from detecting the subscription change immediately.
>
> With the patch applied, repeat steps 1–5.
> Using pg_waldump, identify the LSN of the insert:
> rmgr: Heap len (rec/tot): 59/ 59, tx: 756, lsn:
> 0/01711848, prev 0/01711810, desc: INSERT+INIT off: 1
> rmgr: Transaction len (rec/tot): 46/ 46, tx: 756, lsn:
> 0/01711888, prev 0/01711848, desc: COMMIT 2025-05-02 09:06:09.400926
> IST
>
> Check the confirmed flush LSN from the walsender via gdb by attaching
> it to the walsender process
> (gdb) p *MyReplicationSlot
> ...
> confirmed_flush = 24241928
> (gdb) p /x 24241928
> $4 = 0x171e708
>
> Now attach to the apply worker, set a breakpoint at
> maybe_reread_subscription, and continue execution. Once control
> reaches the function, set test = false. Now it will identify that
> subscription is invalidated and restart the apply worker.
>
> As the walsender has already confirmed_flush position after the
> insert, causing the newly started apply worker to miss the inserted
> row entirely. This leads to the CI failure. This issue can arise when
> the walsender advances more quickly than the apply worker is able to
> detect and react to the subscription change.
>
> I could not find a simpler way to reproduce this.

A simpler way to consistently reproduce the issue is to add a 1-second
sleep in the LogicalRepApplyLoop function, just before the call to
WaitLatchOrSocket. This reproduces the test failure consistently for
me. The failure reason is the same as in [1].

[1] - https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com

Regards,
Vignesh

Attachment Content-Type Size
ci_failure_reproduce.patch text/x-patch 2.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Atsushi Torikoshi 2025-05-04 13:41:30 Re: PG 18 release notes draft committed
Previous Message Wolfgang Walther 2025-05-04 12:58:48 Re: [PoC] Federated Authn/z with OAUTHBEARER