Re: Build-farm - intermittent error in 031_column_list.pl

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Build-farm - intermittent error in 031_column_list.pl
Date: 2022-05-19 11:12:31
Message-ID: CAA4eK1LkvAHXWLtbuQ5JGhJZm_Rww_ukvoF6tqxBYEkRQ2DcVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> >
> > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250(at)gmail(dot)com> wrote in
> > > Hi hackers.
> > >
> > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> > >
> > > The error happened during "subscriptionCheck" phase in the TAP test
> > > t/031_column_list.pl
> > > This test file was added by this [2] commit.
> > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
> >
> > What is happening for all of them looks like that the name of a
> > publication created by CREATE PUBLICATION without a failure report is
> > missing for a walsender came later. It seems like CREATE PUBLICATION
> > can silently fail to create a publication, or walsender somehow failed
> > to find existing one.
> >
>
> Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?
>
> >
> > > ~~
> > >
> >
> > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
> > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002 user=bf database=postgres host=[local]
> >
> > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > the session has been disconnected. But the following request for the
> > same publication fails due to the absense of the publication.
> >
> > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', publication_names '"pub9"')
> > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist
> >
>
> This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> apply worker (which we can see in LOGS as well) and after the restart,
> the apply worker will use the existing slot and replication origin
> corresponding to the subscription. Now, it is possible that before
> restart the origin has not been updated and the WAL start location
> points to a location prior to where PUBLICATION pub9 exists which can
> lead to such an error. Once this error occurs, apply worker will never
> be able to proceed and will always return the same error. Does this
> make sense?
>
> Unless you or others see a different theory, this seems to be the
> existing problem in logical replication which is manifested by this
> test. If we just want to fix these test failures, we can create a new
> subscription instead of altering the existing publication to point to
> the new publication.
>

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2022-05-19 11:28:53 Re: Addition of PostgreSQL::Test::Cluster::pg_version()
Previous Message Andrey Lepikhov 2022-05-19 10:48:18 Re: Removing unneeded self joins