Race-like failure in recovery/t/009_twophase.pl

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Race-like failure in recovery/t/009_twophase.pl
Date: 2017-07-02 03:21:21
Message-ID: 908.1498965681@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Since I put in that patch to reduce pg_ctl's reaction time, we've
seen several buildfarm failures with identical symptoms:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2012%3A56%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2023%3A42%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-01%2000%3A22%3A09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-07-01%2003%3A30%3A01

to wit that the recovery/009 test blows up with

Bailout called. Further testing stopped: pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Fortunately, it reports the relevant log files, and we can see things
like this in the "slave" postmaster's log:

2017-07-01 02:31:51.131 UTC [11665568] LOG: entering standby mode
2017-07-01 02:31:51.198 UTC [11665568] LOG: consistent recovery state reached at 0/301CB10
2017-07-01 02:31:51.198 UTC [11665568] LOG: invalid record length at 0/301CB10: wanted 24, got 0
2017-07-01 02:31:51.200 UTC [12779594] LOG: database system is ready to accept read only connections
2017-07-01 02:31:51.214 UTC [5111896] LOG: fetching timeline history file for timeline 3 from primary server
2017-07-01 02:31:51.246 UTC [5111896] LOG: started streaming WAL from primary at 0/3000000 on timeline 2
2017-07-01 02:31:51.248 UTC [5111896] LOG: replication terminated by primary server
2017-07-01 02:31:51.248 UTC [5111896] DETAIL: End of WAL reached on timeline 2 at 0/301C168.
2017-07-01 02:31:51.263 UTC [11665568] LOG: new timeline 3 forked off current database system timeline 2 before current recovery point 0/301CB10
2017-07-01 02:31:51.263 UTC [5111896] LOG: restarted WAL streaming at 0/3000000 on timeline 2
2017-07-01 02:31:51.264 UTC [5111896] LOG: replication terminated by primary server
2017-07-01 02:31:51.264 UTC [5111896] DETAIL: End of WAL reached on timeline 2 at 0/301C168.
2017-07-01 02:31:51.264 UTC [11665568] LOG: new timeline 3 forked off current database system timeline 2 before current recovery point 0/301CB10
2017-07-01 02:31:51.579 UTC [12779594] LOG: received immediate shutdown request
2017-07-01 02:31:51.629 UTC [12779594] LOG: database system is shut down
2017-07-01 02:31:51.790 UTC [7471446] LOG: listening on Unix socket "/tmp/ELoHNJ0v5u/.s.PGSQL.49810"
2017-07-01 02:31:51.826 UTC [9502748] LOG: database system was interrupted while in recovery at log time 2017-07-01 02:31:49 UTC
2017-07-01 02:31:51.826 UTC [9502748] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2017-07-01 02:31:51.826 UTC [9502748] LOG: entering standby mode
2017-07-01 02:31:51.827 UTC [9502748] FATAL: requested timeline 3 is not a child of this server's history
2017-07-01 02:31:51.827 UTC [9502748] DETAIL: Latest checkpoint is at 0/301CAA0 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/301C168.
2017-07-01 02:31:51.828 UTC [7471446] LOG: startup process (PID 9502748) exited with exit code 1
2017-07-01 02:31:51.829 UTC [7471446] LOG: aborting startup due to startup process failure

What seems to have happened here is that the slave has WAL up to
0/301CB10, but the master's WAL ends at 0/301C168, so we've got
trouble. How did the slave get ahead of the master?

The reporting critters are all on the slow side, so I suspected
a timing problem, especially since it only started to show up
after changing pg_ctl's timing behavior. I can't reproduce it
locally on unmodified sources, but I could after putting my thumb
on the scales like this:

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ea9d21a..7ef22c1 100644
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
*************** WalReceiverMain(void)
*** 446,451 ****
--- 446,454 ----
* Something was received from master, so reset
* timeout
*/
+
+ pg_usleep(100000); /* be slow about processing */
+
last_recv_timestamp = GetCurrentTimestamp();
ping_sent = false;
XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);

(If this doesn't do it for you alone, you might try also increasing
WAITS_PER_SEC in pg_ctl.c, so as to make pg_ctl faster yet.)

I'm not entirely sure what's happening here, although the gist of it
has to be that we're killing off a slave server before it's quite
absorbed all the WAL the master has sent it, and then when we exchange
the roles of the master and slave, all of a sudden it's the slave that's
ahead, not the master.

Part of the reason I'm confused is that the programming technique
being used in 009_twophase.pl, namely doing

($node_master, $node_slave) = ($node_slave, $node_master);

and then working with the reversed variable names, is ENTIRELY TOO CUTE
FOR ITS OWN GOOD. The objects continue to have their original "name"
fields and continue to refer to the same datadirs and log files
internally, which means that when you're trying to trace through the
log files to understand what's happening, you have to remember how many
of these swaps have been done so far in order to understand whether
script operations on "$node_master" are actually affecting the "master"
or the "slave" installation, and whether when the log says that it started
the "slave" server, that corresponds to a script action on "$node_slave"
or "$node_master". This might not be the worst, most confusing code I've
ever seen committed into Postgres, but it's right up there, and you can
bet that I'm going to be holding somebody's feet to the fire for a
rewrite.

Anyway, having vented about that ... it's not very clear to me whether the
test script is at fault for not being careful to let the slave catch up to
the master before we promote it (and then deem the master to be usable as
a slave without rebuilding it first), or whether we actually imagine that
should work, in which case there's a replication logic bug here someplace.
The failure messages are also less clear than they could be, perhaps.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-07-02 07:39:33 Re: pg_stop_backup(wait_for_archive := true) on standby server
Previous Message Peter Eisentraut 2017-07-02 03:11:07 Re: Code quality issues in ICU patch