Re: Use of backup_label not noted in log

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: Use of backup_label not noted in log
Date: 2023-11-20 08:30:31
Message-ID: ZVsZJzDGBzUyMCrH@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> Note that the LSN in the "continuing" case is the one the backup started at,
> not where recovery will start.
>
> I've wondered whether it's worth also adding an explicit message just after
> ReachedEndOfBackup(), but it seems far less urgent due to the existing
> "consistent recovery state reached at %X/%X" message.

Upgrading the surrounding DEBUG1 to a LOG is another option, but I
agree that I've seen less that as being an actual problem in the field
compared to the famous I-removed-a-backup-label-and-I-m-still-up,
until this user sees signs of corruption after recovery was finished,
sometimes days after putting back an instance online.

> Playing around with this a bit, I'm wondering if we instead should remove that
> message, and emit something more informative earlier on. If there's a problem,
> you kinda want the information before we finally get to the loop in
> PerformWalLRecovery(). If e.g. there's no WAL you'll only get
> LOG: invalid checkpoint record
> PANIC: could not locate a valid checkpoint record

I was looking at this code a few weeks ago and have on my stack of
list to do an item about sending a patch to make this exact message
PANIC more talkative as there are a lot of instances with
log_min_messages > log.

> which is delightfully lacking in details.

With a user panicking as much as the server itself, that's even more
tasty.

+ if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+ ereport(LOG,
+ (errmsg("continuing to start from base backup with redo LSN %X/%X",
+ LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));

"Continuing to start" sounds a bit weird to me, though, considering
that there are a few LOGs that say "starting" when there is a signal
file, but I don't have a better idea on top of my mind. So that
sounds OK here.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-11-20 08:51:47 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Previous Message Michael Paquier 2023-11-20 08:07:38 Re: Simplify if/else logic of walsender CreateReplicationSlot