Error restoring from a base backup taken from standby

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Error restoring from a base backup taken from standby
Date: 2012-12-17 17:39:29
Message-ID: 50CF58D1.2060903@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(This is different from the other issue related to timeline switches I
just posted about. There's no timeline switch involved in this one.)

If you do "pg_basebackup -x" against a standby server, in some
circumstances the backup fails to restore with error like this:

C 2012-12-17 19:09:44.042 EET 7832 LOG: database system was not
properly shut down; automatic recovery in progress
C 2012-12-17 19:09:44.091 EET 7832 LOG: record with zero length at
0/1764F48
C 2012-12-17 19:09:44.091 EET 7832 LOG: redo is not required
C 2012-12-17 19:09:44.091 EET 7832 FATAL: WAL ends before end of online
backup
C 2012-12-17 19:09:44.091 EET 7832 HINT: All WAL generated while online
backup was taken must be available at recovery.
C 2012-12-17 19:09:44.092 EET 7831 LOG: startup process (PID 7832)
exited with exit code 1
C 2012-12-17 19:09:44.092 EET 7831 LOG: aborting startup due to startup
process failure

I spotted this bug while reading the code, and it took me quite a while
to actually construct a test case to reproduce the bug, so let me begin
by discussing the code where the bug is. You get the above error, "WAL
ends before end of online backup", when you reach the end of WAL before
reaching the backupEndPoint stored in the control file, which originally
comes from the backup_label file. backupEndPoint is only used in a base
backup taken from a standby, in a base backup taken from the master, the
end-of-backup WAL record is used instead to mark the end of backup. In
the xlog redo loop, after replaying each record, we check if we've just
reached backupEndPoint, and clear it from the control file if we have.
Now the problem is, if there are no WAL records after the checkpoint
redo point, we never even enter the redo loop, so backupEndPoint is not
cleared even though it's reached immediately after reading the initial
checkpoint record.

To deal with the similar situation wrt. reaching consistency for hot
standby purposes, we call CheckRecoveryConsistency() before the redo
loop. The straightforward fix is to copy-paste the check for
backupEndPoint to just before the redo loop, next to the
CheckRecoveryConsistency() call. Even better, I think we should move the
backupEndPoint check into CheckRecoveryConsistency(). It's already
responsible for keeping track of whether minRecoveryPoint has been
reached, so it seems like a good idea to do this check there as well.

Attached is a patch for that (for 9.2), as well as a script I used to
reproduce the bug. The script is a bit messy, and requires tweaking the
paths at the top. Anyone spot a problem with this?

- Heikki

Attachment Content-Type Size
fix-end-of-standby-backup-1.patch text/x-diff 2.4 KB
fix-end-of-standby-backup-1.sh application/x-sh 2.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-12-17 17:47:41 Re: XLByte* usage
Previous Message Tom Lane 2012-12-17 17:15:19 Re: Makefiles don't seem to remember to rebuild everything anymore