Recovery bug

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Recovery bug
Date: 2010-10-15 22:58:03
Message-ID: 1287183483.30681.54.camel@jdavis-ux.asterdata.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


This occurs on the master branch, but also pretty much everywhere else.

To reproduce:

First, it requires a little setup to make sure things happen in the
right (wrong?) order. In postgresql.conf I set:
archive_mode = on
archive_command = 'echo -n archiving: %f... && while [ ! -f /tmp/a ];
do sleep 1; done && rm /tmp/a && echo done'

The purpose of the archive command is to control what files get archived
and when (they aren't actually archived, but that doesn't matter in this
case because I'll never set restore_command). Also, set
checkpoint_completion_target high enough (0.5 worked for me).

Session1:
create table a( i int );
insert into a select generate_series(1,1000000);
select pg_start_backup('b1'); -- this command takes a while due
-- to checkpoint_completion_target

Session2 (quickly before the start_backup() returns):
insert into a select generate_series(1,1000000);

Now, the last insert should have forced a checkpoint (assuming
checkpoint_segments=3), which enables pg_start_backup() to complete.

Cat backup_label to make sure that the checkpoint for the backup
occurred across two WAL segments (for me it's 00...5 and 00...6). If
not, then try again or somehow force the pg_start_backup-checkpoint to
be spread out over more other activity.

Now, do "touch /tmp/a" to step through the archiving process one file at
a time until 00...5 is archived but 00...6 is _not_ archived. Then issue
a manual CHECKPOINT. This should result in 00...6 and beyond existing in
pg_xlog but nothing before it.

Send a SIGQUIT to the postmaster to simulate a crash. When you bring it
back up, it thinks it is recovering from a backup, so it reads
backup_label. The checkpoint for the backup label is in 00...6, so it
reads that just fine. But then it tries to read the WAL starting at the
redo location from that checkpoint, which is in 00...5 and it doesn't
exist and PANICs.

Ordinarily you might say that this is just confusion over whether it's
recovering from a backup or not, and you just need to remove
backup_label and try again. But that doesn't work: at this point
StartupXLOG has already done two things:
1. renamed the backup file to .old
2. updated the control file

So the state has been permanently changed such that it thinks the redo
location is before the earliest WAL you actually have. I see a control
file that looks something like:
Latest checkpoint location: 0/6D17110
Prior checkpoint location: 0/8D4DEF0
Latest checkpoint's REDO location: 0/5000020

(and no more backup_label). That makes no sense, of course. If I try to
restart the postmaster again, it updates the control file again to
something even more surprising:
Latest checkpoint location: 0/6D17110
Prior checkpoint location: 0/6D17110
Latest checkpoint's REDO location: 0/5000020

and PANICs again. Starting the postmaster a third time causes another
PANIC, of course.

There might be an argument that if you crash in the middle of a backup,
you should examine the situation before restarting. But there are two
big problems with that:
1. a backend crash will cause an automatic restart, corrupting
pg_control automatically as well
2. after the restart happens it makes permanent changes that can't be
undone (by mere mortals, anyway), removes important information about
what actually happened, and generally leaves you with little to work
with.

I don't have a fix yet, because I think it requires a little discussion.
For instance, it seems to be dangerous to assume that we're starting up
from a backup with access to the archive when it might have been a crash
of the primary system. This is obviously wrong in the case of an
automatic restart, or one with no restore_command. Fixing this issue
might also remove the annoying "If you are not restoring from a backup,
try removing..." PANIC error message.

Also, in general we should do more logging during recovery, at least the
first stages, indicating what WAL segments it's looking for to get
started, why it thinks it needs that segment (from backup or control
data), etc. Ideally we would verify that the necessary files exist (at
least the initial ones) before making permanent changes. It was pretty
painful trying to work backwards on this problem from the final
controldata (where checkpoint and prior checkpoint are the same, and
redo is before both), a crash, a PANIC, a backup_label.old, and not much
else.

Regards,
Jeff Davis

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ionut Berlea 2010-10-16 09:31:30 BUG #5712: pg_dump for one table specified schema
Previous Message Magnus Hagander 2010-10-15 15:03:11 Re: BUG #5687: RADIUS Authentication issues