From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Add an option to skip loading missing publication to avoid logical replication failure |
Date: | 2025-04-30 16:31:11 |
Message-ID: | CALDaNm3TH3J8fwj+NcdjdN8DZCdrnmm1kzBsHBW2nkN+h6up3A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 30 Apr 2025 at 17:41, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Xuneng Zhou pointed out on Discord that the test case added by
> > 7c99dc587 has caused repeated failures in CI --- though oddly,
> > it's not failed in the buildfarm so far as I can find. The
> > failures look like
> >
> > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading publication: tap_pub_3) at /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
> >
>
> I analyzed the relevant publisher-side CI Logs [1]:
> ...
> 2025-04-19 08:24:14.096 UTC [21961][client backend]
> [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3
> values(1)
> 2025-04-19 08:24:14.098 UTC [21961][client backend]
> [024_add_drop_pub.pl][:0] LOG: disconnection: session time:
> 0:00:00.003 user=postgres database=postgres host=[local]
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
> released logical replication slot "tap_sub"
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
> disconnection: session time: 0:00:00.329 user=postgres
> database=postgres host=[local]
> 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
> LOG: connection received: host=[local]
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG: connection authenticated: user="postgres" method=trust
> (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG: replication connection authorized: user=postgres
> application_name=tap_sub
> 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
> statement: SELECT pg_catalog.set_config('search_path', '', false);
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
> STATEMENT: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/0 (proto_version '4', streaming 'parallel', origin 'any',
> publication_names '"tap_pub_
> ...
>
> This shows that walsender restarts after the "INSERT INTO tab_3
> values(1)" is processed by the previous walsender ("released logical
> replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
> So, it is possible that the old apply worker has sent the confirmation
> of WAL received location after the Insert (due to keep_alive message
> handling). So, after the restart, the new walsender will start
> processing WAL after the INSERT and wait for the skipped message LOG
> timed out.
>
> Considering the above theory is correct, after "ALTER SUBSCRIPTION
> tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
> walsender to restart. We are already doing the same for a similar case
> in 001_rep_changes.pl (See "# check that change of connection string
> and/or publication list causes restart of subscription workers. We
> check the state along with application_name to ensure that the
> walsender is (re)started.).
>
> Unfortunately, I will be away for the rest of the week. In the
> meantime, if you or someone else is able to reproduce and fix it, then
> good; otherwise, I'll take care of it after I return.
I agree with your analysis. I was able to reproduce the issue by
delaying the invalidation of the subscription until the walsender
finished decoding the INSERT operation following the ALTER
SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
of the INSERT after the ALTER SUBSCRIPTION. In this scenario, the
confirmed_flush_lsn ends up pointing to a location after the INSERT.
When the invalidation is eventually received and the apply
worker/walsender is restarted, the restarted walsender begins decoding
from that LSN—after the INSERT—which means the "skipped loading
publication" warning is never triggered, causing the test to fail.
Attached is a patch that ensures the walsender process is properly
restarted after ALTER SUBSCRIPTION, preventing this race condition.
Regards,
Vignesh
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-race-condition-after-ALTER-SUBSCRIPTION-SET-PUBL.patch | text/x-patch | 2.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Sami Imseih | 2025-04-30 16:38:08 | Re: Introduce some randomness to autovacuum |
Previous Message | Japin Li | 2025-04-30 15:48:53 | Fix outdated comments for IndexInfo |