Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

From: Neha Sharma <neha(dot)sharma(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS
Date: 2020-09-14 20:15:52
Message-ID: CANiYTQvsak+efQtnuiBNwKwmnbtCur84-jHrf1mErH7LrW66JQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

I have tested the subscription test 013_partition.pl with CCA enabled on
HEAD and PG13,
and I am able to reproduce the issue on both the versions.

*Logs:*
[centos(at)clobber-cache subscription]$ git branch
* REL_13_STABLE
master
[centos(at)clobber-cache-db93 subscription]$ tail -f
tmp_check/log/013_partition_subscriber1.log
2020-09-15 08:42:19.763 UTC [27866] LOG: logical replication apply worker
for subscription "sub1" has started
2020-09-15 08:42:20.395 UTC [27866] ERROR: cache lookup failed for
relation 0
2020-09-15 08:42:20.436 UTC [26427] LOG: background worker "logical
replication worker" (PID 27866) exited with exit code 1
2020-09-15 08:42:20.835 UTC [27868] LOG: logical replication apply worker
for subscription "sub1" has started
2020-09-15 08:42:21.462 UTC [27868] ERROR: cache lookup failed for
relation 0
2020-09-15 08:42:21.508 UTC [26427] LOG: background worker "logical
replication worker" (PID 27868) exited with exit code 1
2020-09-15 08:42:21.921 UTC [27870] LOG: logical replication apply worker
for subscription "sub1" has started
2020-09-15 08:42:22.551 UTC [27870] ERROR: cache lookup failed for
relation 0

Thanks.
--
Regards,
Neha Sharma

On Mon, Sep 14, 2020 at 8:50 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> In connection with a nearby thread, I tried to run the subscription
> test suite in a CLOBBER_CACHE_ALWAYS build. I soon found that I had
> to increase wal_receiver_timeout, but after doing this:
>
> diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
> index 1488bff..5fe6810 100644
> *** a/src/test/perl/PostgresNode.pm
> --- b/src/test/perl/PostgresNode.pm
> *************** sub init
> *** 447,452 ****
> --- 447,453 ----
> print $conf "log_statement = all\n";
> print $conf "log_replication_commands = on\n";
> print $conf "wal_retrieve_retry_interval = '500ms'\n";
> + print $conf "wal_receiver_timeout = '10min'\n";
>
> # If a setting tends to affect whether tests pass or fail, print it
> after
> # TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby
> permitting
>
> I let it run overnight, and came back to find that it was stuck at
>
> [03:02:15] t/013_partition.pl ................. 19/51
>
> and had been for circa eight hours, where extrapolation from other tests
> said it shouldn't take much over half an hour. Investigation found that
> the subscriber was repeatedly failing like this:
>
> 2020-09-14 11:05:26.483 EDT [1030506] LOG: logical replication apply
> worker for subscription "sub1" has started
> 2020-09-14 11:05:27.139 EDT [1030506] ERROR: cache lookup failed for
> relation 0
> 2020-09-14 11:05:27.140 EDT [947156] LOG: background worker "logical
> replication worker" (PID 1030506) exited with exit code 1
> 2020-09-14 11:05:27.571 EDT [1030509] LOG: logical replication apply
> worker for subscription "sub1" has started
> 2020-09-14 11:05:28.227 EDT [1030509] ERROR: cache lookup failed for
> relation 0
> 2020-09-14 11:05:28.228 EDT [947156] LOG: background worker "logical
> replication worker" (PID 1030509) exited with exit code 1
>
> The publisher's log shows no sign of distress:
>
> 2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG: statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication
> command: IDENTIFY_SYSTEM
> 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication
> command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version
> '2', publication_names '"pub1"')
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: starting logical decoding
> for slot "sub1"
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: Streaming transactions
> committing after 0/163D848, reading WAL from 0/163CF08.
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: logical decoding found
> consistent point at 0/163CF08
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: There are no running
> transactions.
> 2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG: statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> 2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG: received replication
> command: IDENTIFY_SYSTEM
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: received replication
> command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version
> '2', publication_names '"pub1"')
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: starting logical decoding
> for slot "sub1"
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: Streaming transactions
> committing after 0/163D848, reading WAL from 0/163CF08.
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: logical decoding found
> consistent point at 0/163CF08
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: There are no running
> transactions.
>
> I do not have time today to chase this further, but somebody should.
>
> More generally, this seems like good evidence that we really oughta have a
> buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
> not just the core tests. That seems impossibly expensive, but I realized
> while watching these tests that a ridiculous fraction of the runtime is
> being spent in repeated initdb calls. On my machine, initdb takes about
> 12 minutes under CCA, so doing it separately for publisher and subscriber
> means 24 minutes, which compares not at all favorably to the
> circa-half-an-hour total runtime of each of the subscription test scripts.
> We're surely not learning anything after the first CCA run of initdb, so
> if we could find a way to skip that overhead for later runs, it'd make a
> huge difference in the practicality of running these tests under CCA.
>
> I recall having worked on a patch to make the regression tests run
> initdb just once, creating a template directory tree, and then "cp -a"
> that into place for each test. I did not finish it, because it wasn't
> showing a lot of advantage in a normal test run, but maybe the idea
> could be resurrected for CCA and other slow builds.
>
> Another idea is to make CCA a little more dynamic, say allow it to be
> suppressed through an environment variable setting, and then use that
> to speed up per-test initdbs.
>
> regards, tom lane
>
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-09-14 20:50:25 "Unified logging system" breaks access to pg_dump debug outputs
Previous Message Andres Freund 2020-09-14 20:13:35 Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING