Re: Use of backup_label not noted in log

From: David Steele <david(at)pgmasters(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: Use of backup_label not noted in log
Date: 2023-11-20 13:56:47
Message-ID: ad242de2-9258-411b-a561-2384b7f4a436@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[Resending since I accidentally replied off-list]

On 11/18/23 17:49, Andres Freund wrote:
> On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
>>> What about adding it to the "redo starts at" message, something like
>>>
>>> redo starts at 12/12345678, taken from control file
>>>
>>> or
>>>
>>> redo starts at 12/12345678, taken from backup label
>>
>> I think it'd make sense to log use of backup_label earlier than that - the
>> locations from backup_label might end up not being available in the archive,
>> the primary or locally, and we'll error out with "could not locate a valid
>> checkpoint record".
>>
>> I'd probably just do it within the if (read_backup_label()) block in
>> InitWalRecovery(), *before* the ReadCheckpointRecord().
>
> Not enamored with the phrasing of the log messages, but here's a prototype:
>
> When starting up with backup_label present:
> LOG: starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1

I'd prefer something like:

LOG: starting backup recovery with redo...

> When restarting before reaching the end of the backup, but after backup_label
> has been removed:
> LOG: continuing to start from base backup with redo LSN A/34100028
> LOG: entering standby mode
> LOG: redo starts at A/3954B958

And here:

LOG: restarting backup recovery with redo...

> Note that the LSN in the "continuing" case is the one the backup started at,
> not where recovery will start.
>
> I've wondered whether it's worth also adding an explicit message just after
> ReachedEndOfBackup(), but it seems far less urgent due to the existing
> "consistent recovery state reached at %X/%X" message.

I think the current message is sufficient, but what do you have in mind?

> We are quite inconsistent about how we spell LSNs. Sometimes with LSN
> preceding, sometimes not. Sometimes with (LSN). Etc.

Well, this could be improved in HEAD for sure.

>> I do like the idea of expanding the "redo starts at" message
>> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
>> ControlFile->backupEndPoint would provide information about when the node
>> might become consistent.

+1

> Playing around with this a bit, I'm wondering if we instead should remove that
> message, and emit something more informative earlier on. If there's a problem,
> you kinda want the information before we finally get to the loop in
> PerformWalLRecovery(). If e.g. there's no WAL you'll only get
> LOG: invalid checkpoint record
> PANIC: could not locate a valid checkpoint record
>
> which is delightfully lacking in details.

I've been thinking about improving this myself. It would probably also
help a lot to hint that restore_command may be missing or not returning
results (but also not erroring). But there are a bunch of ways to get to
this message so we'd need to be careful.

> There also are some other oddities:
>
> If the primary is down when starting up, and we'd need WAL from the primary
> for the first record, the "redo start at" message is delayed until that
> happens, because we emit the message not before we read the first record, but
> after. That's just plain odd.

Agreed. Moving it up would be better.

> And sometimes we'll start referencing the LSN at which we are starting
> recovery before the "redo starts at" message. If e.g. we shut down
> at a restart point, we'll emit
>
> LOG: consistent recovery state reached at ...
> before
> LOG: redo starts at ...

Huh, I haven't seen that one. Definitely confusing.

> But that's all clearly just material for HEAD.

Absolutely. I've been thinking about some of this as well, but want to
see if we can remove the backup label first so we don't have to rework a
bunch of stuff.

Of course, that shouldn't stop you from proceeding. I'm sure anything
you are thinking of here could be adapted.

Regards,
-David

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2023-11-20 13:59:30 Re: Use of backup_label not noted in log
Previous Message David Steele 2023-11-20 13:50:56 Re: Add recovery to pg_control and remove backup_label