pg_rewind test race condition..?

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: pg_rewind test race condition..?
Date: 2015-04-28 18:02:54
Message-ID: 20150428180253.GU30322@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki,

Not sure if anyone else is seeing this, but I'm getting regression
test failures when running the pg_rewind tests pretty consistently
with 'make check'. Specifically with "basic remote", I'm getting:

source and target cluster are on the same timeline
Failure, exiting

in regress_log/pg_rewind_log_basic_remote.

If I throw a "sleep(5);" into t/001_basic.pl before the call to
RewindTest::run_pg_rewind($test_mode); then everything works fine.

./configure --silent --prefix="$INSTALL" --with-openssl --with-tcl
--with-tclconfig=/usr/lib/tcl8.6 --with-perl --enable-debug
--enable-cassert --enable-tap-tests --with-gssapi

Full unsuccessful pg_rewind_log_basic_remote output:
-----------------------------------------------------------------
waiting for server to start....LOG: database system was shut down at 2015-04-28 13:46:34 EDT
LOG: database system is ready to accept connections
done
server started
waiting for server to start....LOG: database system was interrupted; last known up at 2015-04-28 13:46:35 EDT
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000F8
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
done
server started
server promoting
LOG: received promote request
FATAL: terminating walreceiver process due to administrator command
LOG: invalid record length at 0/30EB410
LOG: redo done at 0/30EB3A0
LOG: last completed transaction was at log time 2015-04-28 13:46:37.540952-04
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections
waiting for server to shut down....LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: shutting down
LOG: database system is shut down
done
server stopped

source and target cluster are on the same timeline
Failure, exiting
waiting for server to start....LOG: database system was shut down at 2015-04-28 13:46:39 EDT
LOG: entering standby mode
LOG: consistent recovery state reached at 0/3311438
LOG: invalid record length at 0/3311438
LOG: database system is ready to accept read only connections
LOG: fetching timeline history file for timeline 2 from primary server
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/30EB410.
LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/3311438
LOG: restarted WAL streaming at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/30EB410.
LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/3311438
done
server started
LOG: received immediate shutdown request
LOG: received immediate shutdown request
-----------------------------------------------------------------

Full successful pg_rewind_log_basic_remote output:

-----------------------------------------------------------------
waiting for server to start....LOG: database system was shut down at 2015-04-28 13:54:04 EDT
LOG: database system is ready to accept connections
done
server started
waiting for server to start....LOG: database system was interrupted; last known up at 2015-04-28 13:54:05 EDT
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000F8
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
done
server started
server promoting
LOG: received promote request
FATAL: terminating walreceiver process due to administrator command
LOG: invalid record length at 0/30EB410
LOG: redo done at 0/30EB3A0
LOG: last completed transaction was at log time 2015-04-28 13:54:07.547533-04
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections
waiting for server to shut down....LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: shutting down
LOG: database system is shut down
done
server stopped
The servers diverged at WAL position 0/30EB410 on timeline 1.
Rewinding from last common checkpoint at 0/30EB3A0 on timeline 1
Done!
waiting for server to start....LOG: database system was interrupted while in recovery at log time 2015-04-28 13:54:08 EDT
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: redo starts at 0/30EB368
LOG: consistent recovery state reached at 0/3113788
LOG: invalid record length at 0/3113788
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
done
server started
LOG: received immediate shutdown request
LOG: unexpected EOF on standby connection
LOG: received immediate shutdown request
-----------------------------------------------------------------

Thanks!

Stephen

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-04-28 18:05:08 Re: Freeze avoidance of very large table.
Previous Message Pavel Stehule 2015-04-28 17:58:15 Re: cache invalidation for PL/pgsql functions