Actual RC of "restore_command" is relevant for DB startup

From: "Gunnar \"Nick\" Bluth" <gunnar(dot)bluth(dot)extern(at)elster(dot)de>
To: pgsql-docs(at)postgresql(dot)org
Cc: Gunnar Nick Bluth <gunnar(dot)bluth(at)pro-open(dot)de>
Subject: Actual RC of "restore_command" is relevant for DB startup
Date: 2016-04-20 12:55:34
Message-ID: 57177C46.6040604@elster.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

Hello,

I've just stumbled across a certain oddity with "restore_command" while
setting up a fresh environment with segmented (i.e., firewalled) networks.

I configured the restore_command as found in the PGBARMan docs (using
ssh) and was a bit stunned that after a restart, I saw this in the logs:

2016-04-20 13:22:45 CEST [3788]: [2-1] db=,user= FATAL: could not
restore file "00000002.history" from archive: child process exited with
exit code 255
2016-04-20 13:22:45 CEST [3786]: [3-1] db=,user= LOG: startup process
(PID 3788) exited with exit code 1
2016-04-20 13:22:45 CEST [3786]: [4-1] db=,user= LOG: aborting startup
due to startup process failure

Which was obviously caused by
ssh: connect to host <archive server> port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.0]

Now, the firewall does not let ssh through (yet), so the root cause is
quite obvious.

However, the docs[1] only state that:
"(...) if the command was terminated by a signal (other than SIGTERM,
which is used as part of a database server shutdown) or an error by the
shell (such as command not found), then recovery will abort and the
server will not start up."

In [2], Kevin Grittner stated that it might be that the commands RC
should by <= 255, otherwise it will be assessed as "failed badly; give up".

And indeed, after amending the restore_command with a "|| exit 1", the
server starts up just fine, using replication to fetch the missing WALs.

Which is ok for me right now as a workaround, however: had I found this
not while setting everything up from scratch, but in case of a disaster
(or simply a downtime or very high load of the archive server while
restarting a slave), this (basically undocumented!) behavior would have
caused me quite a headache...!

I reckon only few users will expect a connection timeout to fall into
the category of "command not found"...

Maybe the part "error by the shell (such as command not found)" could be
changed to "error by the shell (RC > 254, e.g. command not found or ssh
connection failure)" (actually, whatever the real behaviour is, I didn't
check the sources...)?

1
http://www.postgresql.org/docs/current/static/archive-recovery-settings.html
2
http://stackoverflow.com/questions/10524458/postgresql-9-1-streaming-replication-restore-command-special-meaning-of-exit-co

Best regards,
--
Gunnar "Nick" Bluth
DBA ELSTER

Tel: +49 911/991-4665
Mobil: +49 172/8853339

Attachment Content-Type Size
0xAD4790A7.asc application/pgp-keys 3.1 KB

Browse pgsql-docs by date

  From Date Subject
Next Message Jürgen Purtz 2016-04-20 14:30:27 Docbook 5.x
Previous Message Alexander Law 2016-04-18 05:30:00 Some minor error fixes