backup_label during crash recovery: do we know how to solve it?

From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: backup_label during crash recovery: do we know how to solve it?
Date: 2011-11-30 02:10:48
Message-ID: CAAZKuFbEudBBCc7eV_9KukcfNNBeVnL8eJtEXinxUunEhRR2_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Reviving a thread that has hit its second birthday:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg00024.php

In our case not being able to restart Postgres when it has been taken
down in the middle of a base backup is starting to manifest as a
serious source of downtime: basically, any backend crash or machine
restart will cause postgres not to start without human intervention.
The message delivered is sufficiently scary and indirect enough
(because of the plausible scenarios that could cause corruption if
postgres were to make a decision automatically in the most general
case) that it's not all that attractive to train a general operator
rotation to assess what to do, as it involves reading and then,
effectively, ignoring some awfully scary error messages and removing
the backup label file. Even if the error messages weren't scary
(itself a problem if one comes to the wrong conclusion as a result),
the time spent digging around under short notice to confirm what's
going on is a high pole in the tent for improving uptime for us,
taking an extra five to ten minutes per common encounter.

Our problem is compounded by having a lot of databases that take base
backups at attenuated rates in an unattended way, and therefore a
human who may have been woken up from a sound sleep will have to
figure out what was going on before they've reached consciousness,
rather than a person with prior knowledge of having started a backup.
Also, fairly unremarkable databases can take so long to back up that
they may well have a greater than 20% chance of encountering this
problem at any particular time: 20% of a day is less than 5 hours per
day taken to do on-line backups. Basically, we -- and anyone else
with unattended physical backup schemes -- are punished rather
severely by the current design.

This issue has some more recent related incarnations, even if for
different reasons:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00764.php

Because backup_label "coming or going?" confusion in Postgres can have
serious consequences, I wanted to post to the list first to solicit a
minimal design to solve this problem. If it's fairly small in its
mechanics then it may yet be feasible for the January CF.

--
fdr

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2011-11-30 02:24:37 Re: Core Extensions relocation
Previous Message Dan Ports 2011-11-30 00:34:00 Re: autovacuum and default_transaction_isolation