Re: Build-farm - intermittent error in 031_column_list.pl

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Build-farm - intermittent error in 031_column_list.pl
Date: 2022-05-19 09:46:52
Message-ID: CAA4eK1Lc=NDV1HrY2gNasFK90MtysnA575a+rd0p+POjXN+Spw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250(at)gmail(dot)com> wrote in
> > Hi hackers.
> >
> > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> >
> > The error happened during "subscriptionCheck" phase in the TAP test
> > t/031_column_list.pl
> > This test file was added by this [2] commit.
> > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
>
> What is happening for all of them looks like that the name of a
> publication created by CREATE PUBLICATION without a failure report is
> missing for a walsender came later. It seems like CREATE PUBLICATION
> can silently fail to create a publication, or walsender somehow failed
> to find existing one.
>

Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?

>
> > ~~
> >
>
> 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
> 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002 user=bf database=postgres host=[local]
>
> "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> the session has been disconnected. But the following request for the
> same publication fails due to the absense of the publication.
>
> 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', publication_names '"pub9"')
> 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist
>

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

Note: Added Tomas to know his views as he has committed this test.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-05-19 10:21:58 Re: Intermittent buildfarm failures on wrasse
Previous Message Amit Langote 2022-05-19 09:13:28 Re: First draft of the PG 15 release notes