Re: Recovery bug

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Recovery bug
Date: 2010-10-17 22:48:48
Message-ID: 1287355728.8516.383.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, 2010-10-15 at 15:58 -0700, Jeff Davis wrote:
> I don't have a fix yet, because I think it requires a little discussion.
> For instance, it seems to be dangerous to assume that we're starting up
> from a backup with access to the archive when it might have been a crash
> of the primary system. This is obviously wrong in the case of an
> automatic restart, or one with no restore_command. Fixing this issue
> might also remove the annoying "If you are not restoring from a backup,
> try removing..." PANIC error message.
>
> Also, in general we should do more logging during recovery, at least the
> first stages, indicating what WAL segments it's looking for to get
> started, why it thinks it needs that segment (from backup or control
> data), etc. Ideally we would verify that the necessary files exist (at
> least the initial ones) before making permanent changes. It was pretty
> painful trying to work backwards on this problem from the final
> controldata (where checkpoint and prior checkpoint are the same, and
> redo is before both), a crash, a PANIC, a backup_label.old, and not much
> else.
>

Here's a proposed fix. I didn't solve the problem of determining whether
we really are restoring a backup, or if there's just a backup_label file
left around.

I did two things:
1. If reading a checkpoint from the backup_label location, verify that
the REDO location for that checkpoint exists in addition to the
checkpoint itself. If not, elog with a FATAL immediately.
2. Change the error that happens when the checkpoint location
referenced in the backup_label doesn't exist to a FATAL. If it can
happen due to a normal crash, a FATAL seems more appropriate than a
PANIC.

The benefit of this patch is that it won't continue on, corrupting the
pg_controldata along the way. And it also tells the administrator
exactly what's going on and how to correct it, rather than leaving them
with a PANIC and bogus controldata after they crashed in the middle of a
backup.

I still think it would be nice if postgres knew whether it was restoring
a backup or recovering from a crash, otherwise it's hard to
automatically recover from failures. I thought about using the presence
of recoveryRestoreCommand or PrimaryConnInfo to determine that. But it
seemed potentially dangerous if the person restoring a backup simply
forgot to set those, and then it tries restoring from the controldata
instead (which is unsafe to do during a backup).

Comments?

Regards,
Jeff Davis

Attachment Content-Type Size
recovery.patch.gz application/x-gzip 631 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Joel Lopes Da Silva 2010-10-18 06:31:10 BUG #5715: man pages missing after compiling PostgreSQL 9.0.1 sources on OS X 10.6
Previous Message Tom Lane 2010-10-17 15:28:51 Re: BUG #5714: TZ pattern error on to_timestamp