Re: recovery starting when backup_label exists, but not recovery.signal

From: David Steele <david(at)pgmasters(dot)net>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: recovery starting when backup_label exists, but not recovery.signal
Date: 2019-09-26 18:36:33
Message-ID: 49b8f446-06ed-8e91-5dd6-fa4dfee1ee83@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/24/19 1:25 AM, Fujii Masao wrote:
>
> When backup_label exists, the startup process enters archive recovery mode
> even if recovery.signal file doesn't exist. In this case, the startup process
> tries to retrieve WAL files by using restore_command. Then, at the beginning
> of the archive recovery, the contents of backup_label are copied to pg_control
> and backup_label file is removed. This would be an intentional behavior.

> But I think the problem is that, if the server shuts down during that
> archive recovery, the restart of the server may cause the recovery to fail
> because neither backup_label nor recovery.signal exist and the server
> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
>
> So the problematic scenario is;
>
> 1. the server starts with backup_label, but not recovery.signal.
> 2. the startup process enters an archive recovery mode because
> backup_label exists.
> 3. the contents of backup_label are copied to pg_control and
> backup_label is deleted.

Do you mean deleted or renamed to backup_label.old?

> 4. the server shuts down..

This happens after the cluster has reached consistency?

> 5. the server is restarted. neither backup_label nor recovery.signal exist.
> 6. the startup process starts just crash recovery because neither backup_label
> nor recovery.signal exist. Since it cannot retrieve WAL files from archival
> area, it may fail.

I tried a few ways to reproduce this but was not successful without
manually removing WAL. Probably I just needed a much larger set of WAL.

I assume you have a repro? Can you give more details?

> One idea to fix this issue is to make the above step #3 remember that
> backup_label existed, in pg_control. Then we should make the subsequent
> recovery enter an archive recovery mode if pg_control indicates that
> even if neither backup_label nor recovery.signal exist. Thought?

That seems pretty invasive to me at this stage. I'd like to reproduce
it and see if there are alternatives.

Also, are you sure this is a new behavior? I've been finding that some
behaviors that have existed for a long time are suddenly more apparent
or easier to hit with the new mechanism. Examples of that are in [1].

--
-David
david(at)pgmasters(dot)net

[1]
https://www.postgresql.org/message-id/5e6537c7-d10e-6a67-4813-bbd7455cfaf5%40pgmasters.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-09-26 18:36:45 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message Juan José Santamaría Flecha 2019-09-26 18:36:25 Re: Allow to_date() and to_timestamp() to accept localized names