Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS
Date: 2020-09-14 15:20:03
Message-ID: 1032727.1600096803@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In connection with a nearby thread, I tried to run the subscription
test suite in a CLOBBER_CACHE_ALWAYS build. I soon found that I had
to increase wal_receiver_timeout, but after doing this:

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..5fe6810 100644
*** a/src/test/perl/PostgresNode.pm
--- b/src/test/perl/PostgresNode.pm
*************** sub init
*** 447,452 ****
--- 447,453 ----
print $conf "log_statement = all\n";
print $conf "log_replication_commands = on\n";
print $conf "wal_retrieve_retry_interval = '500ms'\n";
+ print $conf "wal_receiver_timeout = '10min'\n";

# If a setting tends to affect whether tests pass or fail, print it after
# TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby permitting

I let it run overnight, and came back to find that it was stuck at

[03:02:15] t/013_partition.pl ................. 19/51

and had been for circa eight hours, where extrapolation from other tests
said it shouldn't take much over half an hour. Investigation found that
the subscriber was repeatedly failing like this:

2020-09-14 11:05:26.483 EDT [1030506] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:27.139 EDT [1030506] ERROR: cache lookup failed for relation 0
2020-09-14 11:05:27.140 EDT [947156] LOG: background worker "logical replication worker" (PID 1030506) exited with exit code 1
2020-09-14 11:05:27.571 EDT [1030509] LOG: logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:28.227 EDT [1030509] ERROR: cache lookup failed for relation 0
2020-09-14 11:05:28.228 EDT [947156] LOG: background worker "logical replication worker" (PID 1030509) exited with exit code 1

The publisher's log shows no sign of distress:

2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: starting logical decoding for slot "sub1"
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: There are no running transactions.
2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG: received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: starting logical decoding for slot "sub1"
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: There are no running transactions.

I do not have time today to chase this further, but somebody should.

More generally, this seems like good evidence that we really oughta have a
buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
not just the core tests. That seems impossibly expensive, but I realized
while watching these tests that a ridiculous fraction of the runtime is
being spent in repeated initdb calls. On my machine, initdb takes about
12 minutes under CCA, so doing it separately for publisher and subscriber
means 24 minutes, which compares not at all favorably to the
circa-half-an-hour total runtime of each of the subscription test scripts.
We're surely not learning anything after the first CCA run of initdb, so
if we could find a way to skip that overhead for later runs, it'd make a
huge difference in the practicality of running these tests under CCA.

I recall having worked on a patch to make the regression tests run
initdb just once, creating a template directory tree, and then "cp -a"
that into place for each test. I did not finish it, because it wasn't
showing a lot of advantage in a normal test run, but maybe the idea
could be resurrected for CCA and other slow builds.

Another idea is to make CCA a little more dynamic, say allow it to be
suppressed through an environment variable setting, and then use that
to speed up per-test initdbs.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-09-14 15:24:51 Re: pg_restore causing deadlocks on partitioned tables
Previous Message Daniel Gustafsson 2020-09-14 15:00:19 Re: pg_dump --where option