pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Date: 2015-09-03 06:25:00
Message-ID: 20150903062500.GB2973274@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception. It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill. The animals share a machine with
each other and with dozens of other developers. I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods. Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.

The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout. Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice. If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.

The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it. The timing dependency comes from the pg_ctl "slop" time:

/*
* Make sanity checks. If it's for a standalone backend
* (negative PID), or the recorded start time is before
* pg_ctl started, then either we are looking at the wrong
* data directory, or this is a pre-existing pidfile that
* hasn't (yet?) been overwritten by our child postmaster.
* Allow 2 seconds slop for possible cross-process clock
* skew.
*/

The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:

All of this could be improved if we rewrote start_postmaster() so that it
could report the child postmaster's PID, so that we'd know a-priori the
correct PID to test with postmaster_is_alive(). That looks like a bit too
much change for so late in the 9.1 development cycle, unfortunately.

I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:

--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;
command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
- 'second pg_ctl start succeeds');
+sleep 3; # bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+ 'second pg_ctl start fails');
command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
'pg_ctl stop -w');

Alternately, I could just remove the test.

crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-09-03 07:26:01 Re: pgbench stats per script & other stuff
Previous Message Tatsuo Ishii 2015-09-03 05:42:02 Re: Horizontal scalability/sharding