Re: Recovery bug

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Recovery bug
Date: 2010-10-19 16:51:35
Message-ID: 1287507095.8516.547.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, 2010-10-19 at 12:26 +0300, Heikki Linnakangas wrote:
> > 1. If reading a checkpoint from the backup_label location, verify that
> > the REDO location for that checkpoint exists in addition to the
> > checkpoint itself. If not, elog with a FATAL immediately.
>
> Makes sense. I wonder if we could just move the rename() after reading
> the checkpoint record?

I assume you mean "after reading the record at the REDO location"? And
you'd need to make sure that any changes to the control data were after
reading the record at the REDO location, as well.

I, too, thought about juggling the order around, but there are quite a
few global variables so it seemed fairly fragile. I'm open to
suggestion, however.

> > 2. Change the error that happens when the checkpoint location
> > referenced in the backup_label doesn't exist to a FATAL. If it can
> > happen due to a normal crash, a FATAL seems more appropriate than a
> > PANIC.
>
> I guess, although it's really not appropriate that the database doesn't
> recover after a crash during a base backup.

Agreed. I'm a little concerned that any fix for that might be intrusive
enough that we didn't want to back-patch it though.

> One alternative is to not remove any WAL files during a base backup. The
> obvious downside is that if the backup takes a long time, you run out of
> disk space.

That doesn't seem very appealing.

> The fundamental problem is that by definition, a base backup is
> completely indistinguishable from the data directory in the original
> server. Or is it? We recommend that you exclude the files under pg_xlog
> from the backup. So we could create a "pg_xlog/just_kidding" file along
> with backup_label. When starting recovery, if just_kidding exists, we
> can assume that we're doing crash recovery and ignore backup_label.

I had some similar ideas, but I rejected them because they added room
for error into the process. One of the problems is that assuming that
we're doing normal recovery when we're doing backup recovery is more
dangerous than the other way around (setting aside this particular bug I
found, anyway). That's because the control data might start the WAL
replay too _late_, which is worse than starting it too early.

> Excluding pg_xlog is just a recommendation at the moment, though, so we
> would need a big warning in the docs. And some way to enforce that
> just_kidding is not included in the backup would be nice, maybe we could
> remove read-permission from it?

Hmm, removing the read bit would add some confidence into the process. I
like that idea better than just assuming that the user won't copy it.

I think I like this direction most, because it doesn't leave us
guessing. If the file is there then we assume normal recovery. If we
encounter recovery.conf we throw FATAL. If we encounter backup_label we
can simply remove it (perhaps with a warning that there was a crash in
the middle of a backup).

However, I do still have some lingering doubts. One is that the file
probably needs to have some content, just in case a backup tool blindly
creates the directory entry and then silently ignores the fact that it
couldn't copy the contents. Another concern is: what if they are using
some kind of filesystem mirroring tool that doesn't take consistent
snapshots (do such tools exist?)? Are there other system-level tools
that people might be using that are safe now, but would be dangerous if
they somehow got a copy of the file?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2010-10-19 19:15:06 Re: BUG #5716: Regression joining tables in UPDATE with composite types
Previous Message Richard Huxton 2010-10-19 13:35:01 BUG #5717: Domain as array of numeric/varchar does not respect limits