Re: [PATCH] Verify Checksums during Basebackups

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-03-04 17:19:00
Message-ID: CABUevEy4VbVz4pF9oLPOd77VLxVOF6_fzmYHMmW6sB8fOT2Scg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Greetings Magnus, all,
>
> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > I think it would also be a good idea to have this a three-mode setting,
> > with "no check", "check and warning", "check and error". Where "check and
> > error" should be the default, but you could turn off that in "save
> whatever
> > is left mode". But I think it's better if pg_basebackup simply fails on a
> > checksum error, because that will make it glaringly obvious that there
> is a
> > problem -- which is the main point of checksums in the first place. And
> > then an option to turn it off completely in cases where performance is
> the
> > thing.
>
> When we implemented page-level checksum checking in pgBackRest, David
> and I had a good long discussion about exactly this question of "warn"
> vs. "error" and came to a different conclusion- you want a backup to
> always back up as much as it can even in the face of corruption. If the
> user has set up their backups in such a way that they don't see the
> warnings being thrown, it's a good bet they won't see failed backups
> happening either, in which case they might go from having "mostly" good
> backups to not having any. Note that I *do* think a checksum failure
> should result in an non-zero exit-code result from pg_basebackup,
> indicating that there was something which went wrong.
>

I would argue that the likelihood of them seeing an error vs a warning is
orders of magnitude higher.

That said, if we want to exit pg_basebacukp with an exit code but still
complete the backup, that would also be a workable way I guess. But I
strongly feel that we should make pg_basebackup scream at the user and
actually exit with an error -- it's the exit with error that will cause
their cronjobs to fail, and thus somebody notice it.

> One difference is that with pgBackRest, we manage the backups and a
> backup with page-level checksums isn't considered "valid", so we won't
> expire old backups if a new backup has a checksum failure, but I'm not
> sure that's really enough to change my mind on if pg_basebackup should
> outright fail on a checksum error or if it should throw big warnings but
> still try to perform the backup. If the admin sets things up in a way
> that a warning and error-exit code from pg_basebackup is ignored and
> they still expire out their old backups, then even having an actual
> error result wouldn't change that.
>

There is another important difference as well. In pgBackRest somebody will
have to explicitly enable checksum verification -- which already there
means that they are much more likely to actually check the logs from it.

As an admin, the first thing I would want in a checksum failure scenario
> is a backup of everything, even the blocks which failed (and then a
> report of which blocks failed...). I'd rather we think about that
> use-case than the use-case where the admin sets up backups in such a way
> that they don't see warnings being thrown from the backup.
>

I agree. But I absolutely don't want my backup to be listed as successful,
because then I might expire old ones.

So sure, if we go with WARNING + exit with an errorcode, that is perhaps
the best combination of the two.

That said, it probably still makes sense to implement all modes. Or at
least to implement a "don't bother verifying the checksums" mode. This
mainly controls what the default would be.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anton Dignös 2018-03-04 17:20:27 Re: IndexJoin memory problem using spgist and boxes
Previous Message Tom Lane 2018-03-04 16:59:42 Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type