Re: Add an option to skip loading missing publication to avoid logical replication failure

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add an option to skip loading missing publication to avoid logical replication failure
Date: 2025-04-30 16:31:11
Message-ID: CALDaNm3TH3J8fwj+NcdjdN8DZCdrnmm1kzBsHBW2nkN+h6up3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 30 Apr 2025 at 17:41, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Xuneng Zhou pointed out on Discord that the test case added by
> > 7c99dc587 has caused repeated failures in CI --- though oddly,
> > it's not failed in the buildfarm so far as I can find. The
> > failures look like
> >
> > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading publication: tap_pub_3) at /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
> >
>
> I analyzed the relevant publisher-side CI Logs [1]:
> ...
> 2025-04-19 08:24:14.096 UTC [21961][client backend]
> [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3
> values(1)
> 2025-04-19 08:24:14.098 UTC [21961][client backend]
> [024_add_drop_pub.pl][:0] LOG: disconnection: session time:
> 0:00:00.003 user=postgres database=postgres host=[local]
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
> released logical replication slot "tap_sub"
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
> disconnection: session time: 0:00:00.329 user=postgres
> database=postgres host=[local]
> 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
> LOG: connection received: host=[local]
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG: connection authenticated: user="postgres" method=trust
> (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG: replication connection authorized: user=postgres
> application_name=tap_sub
> 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
> statement: SELECT pg_catalog.set_config('search_path', '', false);
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
> STATEMENT: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/0 (proto_version '4', streaming 'parallel', origin 'any',
> publication_names '"tap_pub_
> ...
>
> This shows that walsender restarts after the "INSERT INTO tab_3
> values(1)" is processed by the previous walsender ("released logical
> replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
> So, it is possible that the old apply worker has sent the confirmation
> of WAL received location after the Insert (due to keep_alive message
> handling). So, after the restart, the new walsender will start
> processing WAL after the INSERT and wait for the skipped message LOG
> timed out.
>
> Considering the above theory is correct, after "ALTER SUBSCRIPTION
> tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
> walsender to restart. We are already doing the same for a similar case
> in 001_rep_changes.pl (See "# check that change of connection string
> and/or publication list causes restart of subscription workers. We
> check the state along with application_name to ensure that the
> walsender is (re)started.).
>
> Unfortunately, I will be away for the rest of the week. In the
> meantime, if you or someone else is able to reproduce and fix it, then
> good; otherwise, I'll take care of it after I return.

I agree with your analysis. I was able to reproduce the issue by
delaying the invalidation of the subscription until the walsender
finished decoding the INSERT operation following the ALTER
SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
of the INSERT after the ALTER SUBSCRIPTION. In this scenario, the
confirmed_flush_lsn ends up pointing to a location after the INSERT.
When the invalidation is eventually received and the apply
worker/walsender is restarted, the restarted walsender begins decoding
from that LSN—after the INSERT—which means the "skipped loading
publication" warning is never triggered, causing the test to fail.

Attached is a patch that ensures the walsender process is properly
restarted after ALTER SUBSCRIPTION, preventing this race condition.

Regards,
Vignesh

Attachment Content-Type Size
0001-Fix-race-condition-after-ALTER-SUBSCRIPTION-SET-PUBL.patch text/x-patch 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-04-30 16:38:08 Re: Introduce some randomness to autovacuum
Previous Message Japin Li 2025-04-30 15:48:53 Fix outdated comments for IndexInfo