Archive recovery won't be completed on some situation.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Archive recovery won't be completed on some situation.
Date: 2014-03-14 10:32:20
Message-ID: 20140314.193220.123692229.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, we found that postgreql won't complete archive recovery
foever on some situation. This occurs HEAD, 9.3.3, 9.2.7, 9.1.12.

Restarting server with archive recovery fails as following just
after it was killed with SIGKILL after pg_start_backup and some
wal writes but before pg_stop_backup.

| FATAL: WAL ends before end of online backup
| HINT: Online backup started with pg_start_backup() must be
| ended with pg_stop_backup(), and all WAL up to that point must
| be available at recovery.

What the mess is once entering this situation, I could find no
formal operation to exit from it.

On this situation, 'Backup start location' in controldata has
some valid location but corresponding 'end of backup' WAL record
won't come forever.

But I think PG cannot tell the situation dintinctly whether the
'end of backup' reocred is not exists at all or it will come
later especially when the server starts as a streaming
replication hot-standby.

One solution for it would be a new parameter in recovery.conf
which tells that the operator wants the server to start as if
there were no backup label ever before when the situation
comes. It looks ugly and somewhat danger but seems necessary.

The first attached file is the script to replay the problem, and
the second is the patch trying to do what is described above.

After applying this patch on HEAD and uncommneting the
'cancel_backup_label_on_failure = true' in test.sh, the test
script runs as following,

| LOG: record with zero length at 0/2010F40
| WARNING: backup_label was canceled.
| HINT: server might have crashed during backup mode.
| LOG: consistent recovery state reached at 0/2010F40
| LOG: redo done at 0/2010DA0

What do you thing about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
unknown_filename text/plain 517 bytes
recoverying_not_finished_backup.patch text/x-patch 1.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2014-03-14 10:32:45 Re: Providing catalog view to pg_hba.conf file - Patch submission
Previous Message Simon Riggs 2014-03-14 09:56:13 Re: plpgsql.warn_shadow