Re: Add an option to skip loading missing publication to avoid logical replication failure

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Add an option to skip loading missing publication to avoid logical replication failure
Date: 2025-05-02 10:44:31
Message-ID: CABPTF7XH8Uh+K-x3RMt6fOkK3xwSD2YVQehCfp_hb1TS0abe+w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Yeh, tks for your clarification. I have a basic understanding of it now. I
mean is this considered a bug or design defect in the codebase? If so,
should we prevent it from occuring in general, not just for this specific
test.

vignesh C <vignesh21(at)gmail(dot)com>

>
> We have three processes involved in this scenario:
> A walsender process on the publisher, responsible for decoding and
> sending WAL changes.
> An apply worker process on the subscriber, which applies the changes.
> A session executing the ALTER SUBSCRIPTION command.
>
> Due to the asynchronous nature of these processes, the ALTER
> SUBSCRIPTION command may not be immediately observed by the apply
> worker. Meanwhile, the walsender may process and decode an INSERT
> statement.
> If the insert targets a table (e.g., tab_3) that does not belong to
> the current publication (pub1), the walsender silently skips
> replicating the record and advances its decoding position. This
> position is sent in a keepalive message to the subscriber, and since
> there are no pending transactions to flush, the apply worker reports
> it as the latest received LSN.
> Later, when the apply worker eventually detects the subscription
> change, it restarts—but by then, the insert has already been skipped
> and is no longer eligible for replay, as the table was not part of the
> publication (pub1) at the time of decoding.
> This race condition arises because the three processes run
> independently and may progress at different speeds due to CPU
> scheduling or system load.
> Thoughts?
>
> Regards,
> Vignesh
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-05-02 12:04:42 Re: fixing CREATEROLE
Previous Message shveta malik 2025-05-02 09:35:15 Re: Fix slot synchronization with two_phase decoding enabled