Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
Cc: David Zhang <david(dot)zhang(at)highgo(dot)ca>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Date: 2023-09-27 07:25:41
Message-ID: ZRPY9VwlNYygle4o@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 21, 2023 at 11:45:06AM +0800, Bowen Shi wrote:
> First I encountered the problem " FATAL: could not find
> recovery.signal or standby.signal when recovering with backup_label ",
> then I deleted the backup_label file and started the instance
> successfully.

Doing that is equal to corrupting your instance as recovery would
begin from the latest redo LSN stored in the control file, but the
physical relation files included in the backup may include blocks that
require records that are needed before this redo LSN and the LSN
stored in the backup_label.

>> Delete a backup_label from a fresh base backup can easily lead to data
>> corruption, as the startup process would pick up as LSN to start
>> recovery from the control file rather than the backup_label file.
>> This would happen if a checkpoint updates the redo LSN in the control
>> file while a backup happens and the control file is copied after the
>> checkpoint, for instance. If one wishes to deploy a new primary from
>> a base backup, recovery.signal is the way to go, making sure that the
>> new primary is bumped into a new timeline once recovery finishes, on
>> top of making sure that the startup process starts recovery from a
>> position where the cluster would be able to achieve a consistent
>> state.

And that's what I mean here. In more details. So, really, don't do
that.

> ereport(FATAL,
> (errmsg("could not find redo location referenced by checkpoint record"),
> errhint("If you are restoring from a backup, touch
> \"%s/recovery.signal\" and add required recovery options.\n"
> "If you are not restoring from a backup, try removing the file
> \"%s/backup_label\".\n"
> "Be careful: removing \"%s/backup_label\" will result in a corrupt
> cluster if restoring from a backup.",
> DataDir, DataDir, DataDir)));
>
> There are two similar error messages in xlogrecovery.c. Maybe we can
> modify the error messages to be similar.

The patch adds the following message, which is written this way to be
consistent with the two others, already:

+ ereport(FATAL,
+ (errmsg("could not find recovery.signal or standby.signal when recovering with backup_label"),
+ errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.",
+ DataDir, DataDir)));

But you have an interesting point here, why isn't standby.signal also
mentioned in the two existing comments? Depending on what's wanted by
the user this can be equally useful to report back.

Attached is a slightly updated patch, where I have also removed the
check on ArchiveRecoveryRequested because the FATAL generated for
!ArchiveRecoveryRequested makes sure that it is useless after reading
the backup_label file.

This patch has been around for a few months now. Do others have
opinions about the direction taken here to make the presence of
recovery.signal or standby.signal a hard requirement when a
backup_label file is found (HEAD only)?
--
Michael

Attachment Content-Type Size
v3-0001-Strengthen-use-of-ArchiveRecoveryRequested-and-In.patch text/x-diff 9.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yuya Watari 2023-09-27 07:28:46 Re: [PoC] Reducing planning time when tables have many partitions
Previous Message David Rowley 2023-09-27 07:01:06 Re: Use virtual tuple slot for Unique node