Re: [PATCH] Verify Checksums during Basebackups

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-03-04 17:29:14
Message-ID: 20180304172914.GU2416@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings Magnus, all,

* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > > I think it would also be a good idea to have this a three-mode setting,
> > > with "no check", "check and warning", "check and error". Where "check and
> > > error" should be the default, but you could turn off that in "save
> > whatever
> > > is left mode". But I think it's better if pg_basebackup simply fails on a
> > > checksum error, because that will make it glaringly obvious that there
> > is a
> > > problem -- which is the main point of checksums in the first place. And
> > > then an option to turn it off completely in cases where performance is
> > the
> > > thing.
> >
> > When we implemented page-level checksum checking in pgBackRest, David
> > and I had a good long discussion about exactly this question of "warn"
> > vs. "error" and came to a different conclusion- you want a backup to
> > always back up as much as it can even in the face of corruption. If the
> > user has set up their backups in such a way that they don't see the
> > warnings being thrown, it's a good bet they won't see failed backups
> > happening either, in which case they might go from having "mostly" good
> > backups to not having any. Note that I *do* think a checksum failure
> > should result in an non-zero exit-code result from pg_basebackup,
> > indicating that there was something which went wrong.
>
> I would argue that the likelihood of them seeing an error vs a warning is
> orders of magnitude higher.
>
> That said, if we want to exit pg_basebacukp with an exit code but still
> complete the backup, that would also be a workable way I guess. But I
> strongly feel that we should make pg_basebackup scream at the user and
> actually exit with an error -- it's the exit with error that will cause
> their cronjobs to fail, and thus somebody notice it.

Yes, we need to have it exit with a non-zero exit code, I definitely
agree with that. Any indication that the backup may not be valid should
do that, imv. I don't believe we should just abort the backup and throw
away whatever effort has gone into getting the data thus far and then
leave an incomplete backup in place- someone might think it's not
incomplete.. I certainly hope you weren't thinking that pg_basebackup
would then go through and remove the backup that it had been running
when it reached the checksum failure- that would be a dangerous and
rarely tested code path, after all.

> > One difference is that with pgBackRest, we manage the backups and a
> > backup with page-level checksums isn't considered "valid", so we won't
> > expire old backups if a new backup has a checksum failure, but I'm not
> > sure that's really enough to change my mind on if pg_basebackup should
> > outright fail on a checksum error or if it should throw big warnings but
> > still try to perform the backup. If the admin sets things up in a way
> > that a warning and error-exit code from pg_basebackup is ignored and
> > they still expire out their old backups, then even having an actual
> > error result wouldn't change that.
>
> There is another important difference as well. In pgBackRest somebody will
> have to explicitly enable checksum verification -- which already there
> means that they are much more likely to actually check the logs from it.

That's actually not correct- we automatically check page-level checksums
when the C library is available (and it's now required as part of 2.0)
and the database has checksums enabled (that's required of both methods,
of course...), so I don't see the difference you're suggesting here.
pgBackRest does have an option to *require* checksum-checking be done,
and one to disable checksumming, but by default it's enabled. See:

https://pgbackrest.org/command.html#command-backup/category-command/option-checksum-page

> > As an admin, the first thing I would want in a checksum failure scenario
> > is a backup of everything, even the blocks which failed (and then a
> > report of which blocks failed...). I'd rather we think about that
> > use-case than the use-case where the admin sets up backups in such a way
> > that they don't see warnings being thrown from the backup.
>
> I agree. But I absolutely don't want my backup to be listed as successful,
> because then I might expire old ones.
>
> So sure, if we go with WARNING + exit with an errorcode, that is perhaps
> the best combination of the two.

Right, that's what we settled on for pgBackRest also and definitely
makes the most sense to me.

> That said, it probably still makes sense to implement all modes. Or at
> least to implement a "don't bother verifying the checksums" mode. This
> mainly controls what the default would be.

Yes, I'm fine with a "don't verify checksums" option, but I believe the
default should be to verify checksums when the database is configured
with them and, on a checksum failure, throw warnings and exit with an
exit-code that's non-zero and means "page-level checksum verification
failed."

Thanks!

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2018-03-04 18:07:11 Re: [HACKERS] plpgsql - additional extra checks
Previous Message Anton Dignös 2018-03-04 17:20:27 Re: IndexJoin memory problem using spgist and boxes