RE: Build-farm - intermittent error in 031_column_list.pl

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Build-farm - intermittent error in 031_column_list.pl
Date: 2022-05-20 14:53:22
Message-ID: TYCPR01MB83737A68CD5D554EA82BD7B9EDD39@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, May 19, 2022 8:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > >
> > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith
> > > <smithpb2250(at)gmail(dot)com> wrote in
> > > > Hi hackers.
> > > >
> > > > FYI, I saw that there was a recent Build-farm error on the
> > > > "grison" machine [1] [1]
> > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison
> > > > &br=HEAD
> > > >
> > > > The error happened during "subscriptionCheck" phase in the TAP
> > > > test t/031_column_list.pl This test file was added by this [2]
> > > > commit.
> > > > [2]
> > > >
> https://github.com/postgres/postgres/commit/923def9a533a7d986acfb5
> > > > 24139d8b9e5466d0a5
> > >
> > > 2022-04-17 00:16:04.278 CEST [293659][client
> > > backend][4/270:0][031_column_list.pl] LOG: statement: CREATE
> > > PUBLICATION pub9 FOR TABLE test_part_d (a) WITH
> > > (publish_via_partition_root = true);
> > > 2022-04-17 00:16:04.279 CEST [293659][client
> > > backend][:0][031_column_list.pl] LOG: disconnection: session time:
> > > 0:00:00.002 user=bf database=postgres host=[local]
> > >
> > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > > the session has been disconnected. But the following request for the
> > > same publication fails due to the absense of the publication.
> > >
> > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1]
> > > STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88
> > > (proto_version '3', publication_names '"pub9"')
> > > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:
> > > publication "pub9" does not exist
> > >
> >
> > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9".
> The
> > probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> > apply worker (which we can see in LOGS as well) and after the restart,
> > the apply worker will use the existing slot and replication origin
> > corresponding to the subscription. Now, it is possible that before
> > restart the origin has not been updated and the WAL start location
> > points to a location prior to where PUBLICATION pub9 exists which can
> > lead to such an error. Once this error occurs, apply worker will never
> > be able to proceed and will always return the same error. Does this
> > make sense?
> >
> > Unless you or others see a different theory, this seems to be the
> > existing problem in logical replication which is manifested by this
> > test. If we just want to fix these test failures, we can create a new
> > subscription instead of altering the existing publication to point to
> > the new publication.
> >
>
> If the above theory is correct then I think allowing the publisher to catch up with
> "$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION
> should fix this problem. Because if before ALTER both publisher and
> subscriber are in sync then the new publication should be visible to
> WALSender.
Hi,

I've attached a patch for the fix proposed here.
First of all, thank you so much for helping me offlist, Amit-san.

I reproduced the failure like [1] by commenting out
WalSndWaitForWal's call of WalSndKeepalive and running the test.
This comment out intends to suppress the advance of confirmed_flush location
after creating a publication.

In short, my understanding how the bug happened is,
1. we execute 'create publication pubX' and create one publication.
2. 'alter subscription subY set publication pubX' makes the apply worker exit
3. relaunched apply worker searches for pubX. But, the slot position(confirmed_flush)
doesn't get updated and points to some location before create publication at the publisher node.

Applying the attached patch have made the test pass.

[1] the subscriber's log

2022-05-20 08:56:50.773 UTC [5153] 031_column_list.pl LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub6
2022-05-20 08:56:50.801 UTC [5156] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:50.846 UTC [5112] LOG: logical replication apply worker for subscription "sub1" will restart because of a parameter change
2022-05-20 08:56:50.915 UTC [5158] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
...
2022-05-20 08:56:51.257 UTC [5164] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:51.353 UTC [5166] LOG: logical replication apply worker for subscription "sub1" has started
2022-05-20 08:56:51.366 UTC [5168] LOG: logical replication table synchronization worker for subscription "sub1", table "test_part_a" has started
2022-05-20 08:56:51.370 UTC [5171] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:51.373 UTC [5166] ERROR: could not receive data from WAL stream: ERROR: publication "pub6" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C61B8
2022-05-20 08:56:51.374 UTC [4338] LOG: background worker "logical replication worker" (PID 5166) exited with exit code 1

Best Regards,
Takamichi Osumi

Attachment Content-Type Size
v1-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patch application/octet-stream 4.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-05-20 15:34:55 Re: 15beta1 test failure on mips in isolation/expected/stats
Previous Message Maxim Orlov 2022-05-20 14:38:03 Re: Add 64-bit XIDs into PostgreSQL 15