Re: [PATCH] Verify Checksums during Basebackups

From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-03-31 12:54:05
Message-ID: 20180331125404.GA20852@nighthawk.caipicrew.dd-dns.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Mar 30, 2018 at 07:46:02AM -0400, Stephen Frost wrote:
> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > On Fri, Mar 30, 2018 at 5:35 AM, David Steele <david(at)pgmasters(dot)net> wrote:
> >
> > > On 3/24/18 10:32 AM, Michael Banck wrote:
> > > > Am Freitag, den 23.03.2018, 17:43 +0100 schrieb Michael Banck:
> > > >> Am Freitag, den 23.03.2018, 10:54 -0400 schrieb David Steele:
> > > >>> In my experience actual block errors are relatively rare, so there
> > > >>> aren't likely to be more than a few in a file. More common are
> > > >>> overwritten or transposed files, rogue files, etc. These produce a lot
> > > >>> of output.
> > > >>>
> > > >>> Maybe stop after five?
> > > >
> > > > The attached patch does that, and outputs the total number of
> > > > verification failures of that file after it got sent.
> > > >
> > > >> I'm on board with this, but I have the feeling that this is not a very
> > > >> common pattern in Postgres, or might not be project style at all. I
> > > >> can't remember even seen an error message like that.
> > > >>
> > > >> Anybody know whether we're doing this in a similar fashion elsewhere?
> > > >
> > > > I tried to have look around and couldn't find any examples, so I'm not
> > > > sure that patch should go in. On the other hand, we abort on checksum
> > > > failures usually (in pg_dump e.g.), so limiting the number of warnings
> > > > does makes sense.
> > > >
> > > > I guess we need to see what others think.
> > >
> > > Well, at this point I would say silence more or less gives consent.
> > >
> > > Can you provide a rebased patch with the validation retry and warning
> > > limiting logic added? I would like to take another pass through it but I
> > > think this is getting close.
> >
> > I was meaning to mention it, but ran out of cycles.
> >
> > I think this is the right way to do it, except the 5 should be a #define
> > and not an inline hardcoded value :) We could argue whether it should be "5
> > total" or "5 per file". When I read the emails I thought it was going to be
> > 5 total, but I see the implementation does 5 / file. In a super-damanged
> > system that will still lead to horrible amounts of logging, but I think
> > maybe if your system is in that bad shape, then it's a lost cause anyway.
>
> 5/file seems reasonable to me as well.
>
> > I also think the "total number of checksum errors" should be logged if
> > they're >0, not >5. And I think *that* one should be logged at the end of
> > the entire process, not per file. That'd be the kind of output that would
> > be the most interesting, I think (e.g. if I have it spread out with 1 block
> > each across 4 files, I want that logged at the end because it's easy to
> > otherwise miss one or two of them that may have happened a long time apart).
>
> I definitely like having a total # of checksum errors included at the
> end, if there are any at all. When someone is looking to see why the
> process returned a non-zero exit code, they're likely to start looking
> at the end of the log, so having that easily available and clear as to
> why the backup failed is definitely valuable.
>
> > I don't think we have a good comparison elsewhere, and that is as David
> > mention because other codepaths fail hard when they run into something like
> > that. And we explicitly want to *not* fail hard, per previous discussion.
>
> Agreed.

Attached is a new and rebased patch which does the above, plus
integrates the suggested changes by David Steele. The output is now:

$ initdb -k --pgdata=data1 1> /dev/null 2> /dev/null
$ pg_ctl --pgdata=data1 --log=pg1.log start > /dev/null
$ dd conv=notrunc oflag=seek_bytes seek=4000 bs=8 count=1 if=/dev/zero of=data1/base/12374/2610 2> /dev/null
$ for i in 4000 13000 21000 29000 37000 43000; do dd conv=notrunc oflag=seek_bytes seek=$i bs=8 count=1 if=/dev/zero of=data1/base/12374/1259; done 2> /dev/null
$ pg_basebackup -v -h /tmp --pgdata=data2
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/2000060 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_13882"
WARNING: checksum verification failed in file "./base/12374/2610", block 0: calculated C2C9 but expected EC78
WARNING: checksum verification failed in file "./base/12374/1259", block 0: calculated 8BAE but expected 46B8
WARNING: checksum verification failed in file "./base/12374/1259", block 1: calculated E413 but expected 7701
WARNING: checksum verification failed in file "./base/12374/1259", block 2: calculated 5DA9 but expected D5AA
WARNING: checksum verification failed in file "./base/12374/1259", block 3: calculated 5651 but expected 4F5E
WARNING: checksum verification failed in file "./base/12374/1259", block 4: calculated DF39 but expected DF00
WARNING: further checksum verification failures in file "./base/12374/1259" will not be reported
WARNING: file "./base/12374/1259" has a total of 6 checksum verification failures
WARNING: 7 total checksum verification failures
pg_basebackup: write-ahead log end point: 0/2000130
pg_basebackup: checksum error occured
$ echo $?
1
$ du -s data2
38820 data2

I ommitted the 'file "foo" has a total of X checksum verification
failures' if there is only one, as seen with file "./base/12374/2610"
above. Same with the "X total checksum verification failures" if there
was only one.

Is that ok with everybody?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Attachment Content-Type Size
basebackup-verify-checksum-V5.patch text/x-diff 20.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-03-31 13:04:15 Re: [HACKERS][PATCH] adding simple sock check for windows
Previous Message CharSyam 2018-03-31 12:38:55 Re: [HACKERS][PATCH] adding simple sock check for windows