From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
Subject: | Detecting some cases of missing backup_label |
Date: | 2023-11-30 20:56:05 |
Message-ID: | 20231130205605.slaaw2ny5sjmukn3@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I recently mentioned to Robert (and also Heikki earlier), that I think I see a
way to detect an omitted backup_label in a relevant subset of the cases (it'd
apply to the pg_control as well, if we moved to that). Robert encouraged me
to share the idea, even though it does not provide complete protection.
The subset I think we can address is the following:
a) An omitted backup_label would lead to corruption, i.e. without the
backup_label we won't start recovery at the right position. Obviously it'd
be better to also catch a wrong procedure when it'd not cause corruption -
perhaps my idea can be extended to handle that, with a small bit of
overhead.
b) The backup has been taken from a primary. Unfortunately that probably can't
be addressed - but the vast majority of backups are taken from a primary,
so I think it's still a worthwhile protection.
Here's my approach
1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
primary, emitted just *after* the checkpoint completed
2) When replaying a base backup start record, we create a state file that
includes the corresponding LSN in the filename
3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
that time. Instead the state file is created during pg_backup_stop().
4) When replaying a XLOG_BACKUP_END record, we verif that the state file
created by XLOG_BACKUP_START is present, and error out if not. Backups
that started before the redo LSN from backup_label are ignored
(necessitates remembering that LSN, but we've been discussing that anyway).
Because the backup state file on the primary is only created during
pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
and pg_backup_stop() does *not* contain the corresponding "backup state
file". Because of this, an omitted backup_label is detected if recovery does
not start early enough - recovery won't encounter the XLOG_BACKUP_START record
and thus would not create the state file, leading to an error in 4).
It is not a problem that the primary does not create the state file before the
pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
XLOG_BACKUP_END and thus no error will be raised. It's a bit odd that the
sequence differs between normal processing and recovery, but I think that's
nothing a good comment couldn't explain.
I haven't worked out the details, but I think we might be able extend this to
catch errors even if there is no checkpoint during the base backup, by
emitting the WAL record *before* the RequestCheckpoint(), and creating the
corresponding state file during backup_label processing at the start of
recovery. That'd probably make the logic for when we can remove the backup
state files a bit more complicated, but I think we could deal with that.
Comments? Swear words?
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Tristan Partin | 2023-11-30 21:00:22 | Re: meson: Stop using deprecated way getting path of files |
Previous Message | Nathan Bossart | 2023-11-30 20:54:26 | Re: CRC32C Parallel Computation Optimization on ARM |