Re: backup manifests

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Tels <nospam-pg-abuse(at)bloodgate(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: backup manifests
Date: 2020-03-27 20:57:46
Message-ID: 20200327205745.GI13712@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> On Fri, Mar 27, 2020 at 11:26 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > > Seems better to (later?) add support for generating manifests for WAL
> > > files, and then have a tool that can verify all the manifests required
> > > to restore a base backup.
> >
> > I'm not trying to expand on the feature set here or move the goalposts
> > way down the road, which is what seems to be what's being suggested
> > here. To be clear, I don't have any objection to adding a generic tool
> > for validating WAL as you're talking about here, but I also don't think
> > that's required for pg_validatebackup. What I do think we need is a
> > check of the WAL that's fetched when people use pg_basebackup -Xstream
> > or -Xfetch. pg_basebackup itself has that check because it's critical
> > to the backup being successful and valid. Not having that basic
> > validation of a backup really just isn't ok- there's a reason
> > pg_basebackup has that check.
>
> I don't understand how this could be done without significantly
> complicating the architecture. As I said before, -Xstream sends WAL
> over a separate connection that is unrelated to the one running
> BASE_BACKUP, so the base-backup connection doesn't know what to
> include in the manifest. Now you could do something like: once all of
> the WAL files have been fetched, the client checksums all of those and
> sends their names and checksums to the server, which turns around and
> puts them into the manifest, which it then sends back to the client.
> But that is actually quite a bit of additional complexity, and it's
> pretty strange, too, because now you have the client checksumming some
> files and the server checksumming others. I know you mentioned a few
> different ideas before, but I think they all kinda have some problem
> along these lines.

I've made some suggestions before, also chatted about an idea with David
that I'll outline here.

First off- I'm a bit mystified why you are saying that the base backup
connection doesn't know what to include in the manifest regarding WAL.
The base-backup process determines the starting position (and then even
puts it into the backup_label that's sent to the client), and then it
directly returns the ending position at the end of the BASE_BACKUP
command. Given that we do know that information, then we just need to
get the checksums/hashes for each of the WAL files, if it's been asked
for. How do we know checksums or hashes have been asked for in the
WAL streaming connection? We can have the pg_basebackup process ask for
that when it connects to stream the WAL that's needed.

Now the only part that's a little grotty is dealing with passing the
checksums/hashes that the WAL stream connection calculates over to the
base backup connection to include in the manifest. Offhand though, it
seems like we could drop a file in archive_status for that, perhaps
"wal_checksums.PID" or such (the PID would be that of the PG backend
that's doing the base backup, which we'd pass to START_REPLICATION). Of
course, the backup process would have to check and make sure that it got
all the needed WAL file checksums, but since it knows the end, that
shouldn't be too bad.

> I also kinda disagree with the idea that the WAL should be considered
> an integral part of the backup. I don't know how pgbackrest does
> things, but BART stores each backup in a separate directly without any
> associated WAL, and then keeps all the WAL together in a different
> directory. I imagine that people who are using continuous archiving
> also tend to use -Xnone, or if they do backups by copying the files
> rather than using pg_backrest, they exclude pg_wal. In fact, for
> people with big, important databases, I'd assume that would be the
> normal pattern. You presumably wouldn't want to keep one copy of the
> WAL files taken during the backup with the backup itself, and a
> separate copy in the archive.

I really don't know what to say to this. WAL is absolutely critical to
a backup being valid. pgBackRest doesn't have a way to *just* validate
a backup today, unfortunately, but we're planning to support it in the
future and we will absolutely include in that validation checking all of
the WAL that's part of the backup.

I'm fine with forgoing all of this in the -X none case, as I've said
elsewhere. I think it'd be great for pg_receivewal to have a way to
validate WAL and such, but that's a clearly new feature and it's
independent from validating a backup.

As it relates to how pgBackRest stores WAL, we actually do support both
of the options you mention, because people with big important databases
like to be extra paranoid. WAL can either be stored in just the
archive, or it can be stored in both the archive and in the backup (with
'--archive-copy'). Note that this isn't done by just grabbing whatever
is in pg_wal at the time of the backup, as that wouldn't actually work,
but rather by copying the necessary WAL from the archive at the end of
the backup.

We do also check all WAL that's pulled from the archive by the restore
command, though exactly what WAL is needed isn't something we know ahead
of time (yet, anyway.. we are working on WAL parsing code that'll
change that by actually scanning the WAL and storing all restore points,
starting/ending times and transaction IDs, and anything else that can be
used as a restore target, so we can figure out exactly all WAL that's
needed to get to a particular restore target).

We actually have someone who implemented an independent tool called
check_pgbackrest which specifically has a "archives" check, for checking
that the WAL is in the archive. We plan to also provide a way to ask
pgbackrest to confirm that there's no missing WAL, and that all of the
WAL is valid.

WAL is critical to a backup that's been taken in an online manner, no
matter where it's stored. A backup isn't valid without the WAL that's
needed to reach consistency.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2020-03-27 21:07:42 Re: backup manifests
Previous Message Sergei Kornilov 2020-03-27 20:50:43 Re: allow online change primary_conninfo