Re: [PATCH] Verify Checksums during Basebackups

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-04-01 15:27:21
Message-ID: CABUevEzz9NfUtjXtfjCyjMuHGBdFQAd=OfUsohK=epeGfBeqjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 31, 2018 at 2:54 PM, Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Hi,
>
> On Fri, Mar 30, 2018 at 07:46:02AM -0400, Stephen Frost wrote:
> > * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > > On Fri, Mar 30, 2018 at 5:35 AM, David Steele <david(at)pgmasters(dot)net>
> wrote:
> > >
> > > > On 3/24/18 10:32 AM, Michael Banck wrote:
> > > > > Am Freitag, den 23.03.2018, 17:43 +0100 schrieb Michael Banck:
> > > > >> Am Freitag, den 23.03.2018, 10:54 -0400 schrieb David Steele:
> > > > >>> In my experience actual block errors are relatively rare, so
> there
> > > > >>> aren't likely to be more than a few in a file. More common are
> > > > >>> overwritten or transposed files, rogue files, etc. These
> produce a lot
> > > > >>> of output.
> > > > >>>
> > > > >>> Maybe stop after five?
> > > > >
> > > > > The attached patch does that, and outputs the total number of
> > > > > verification failures of that file after it got sent.
> > > > >
> > > > >> I'm on board with this, but I have the feeling that this is not a
> very
> > > > >> common pattern in Postgres, or might not be project style at
> all. I
> > > > >> can't remember even seen an error message like that.
> > > > >>
> > > > >> Anybody know whether we're doing this in a similar fashion
> elsewhere?
> > > > >
> > > > > I tried to have look around and couldn't find any examples, so I'm
> not
> > > > > sure that patch should go in. On the other hand, we abort on
> checksum
> > > > > failures usually (in pg_dump e.g.), so limiting the number of
> warnings
> > > > > does makes sense.
> > > > >
> > > > > I guess we need to see what others think.
> > > >
> > > > Well, at this point I would say silence more or less gives consent.
> > > >
> > > > Can you provide a rebased patch with the validation retry and warning
> > > > limiting logic added? I would like to take another pass through it
> but I
> > > > think this is getting close.
> > >
> > > I was meaning to mention it, but ran out of cycles.
> > >
> > > I think this is the right way to do it, except the 5 should be a
> #define
> > > and not an inline hardcoded value :) We could argue whether it should
> be "5
> > > total" or "5 per file". When I read the emails I thought it was going
> to be
> > > 5 total, but I see the implementation does 5 / file. In a
> super-damanged
> > > system that will still lead to horrible amounts of logging, but I think
> > > maybe if your system is in that bad shape, then it's a lost cause
> anyway.
> >
> > 5/file seems reasonable to me as well.
> >
> > > I also think the "total number of checksum errors" should be logged if
> > > they're >0, not >5. And I think *that* one should be logged at the end
> of
> > > the entire process, not per file. That'd be the kind of output that
> would
> > > be the most interesting, I think (e.g. if I have it spread out with 1
> block
> > > each across 4 files, I want that logged at the end because it's easy to
> > > otherwise miss one or two of them that may have happened a long time
> apart).
> >
> > I definitely like having a total # of checksum errors included at the
> > end, if there are any at all. When someone is looking to see why the
> > process returned a non-zero exit code, they're likely to start looking
> > at the end of the log, so having that easily available and clear as to
> > why the backup failed is definitely valuable.
> >
> > > I don't think we have a good comparison elsewhere, and that is as David
> > > mention because other codepaths fail hard when they run into something
> like
> > > that. And we explicitly want to *not* fail hard, per previous
> discussion.
> >
> > Agreed.
>
> Attached is a new and rebased patch which does the above, plus
> integrates the suggested changes by David Steele. The output is now:
>
> $ initdb -k --pgdata=data1 1> /dev/null 2> /dev/null
> $ pg_ctl --pgdata=data1 --log=pg1.log start > /dev/null
> $ dd conv=notrunc oflag=seek_bytes seek=4000 bs=8 count=1 if=/dev/zero
> of=data1/base/12374/2610 2> /dev/null
> $ for i in 4000 13000 21000 29000 37000 43000; do dd conv=notrunc
> oflag=seek_bytes seek=$i bs=8 count=1 if=/dev/zero
> of=data1/base/12374/1259; done 2> /dev/null
> $ pg_basebackup -v -h /tmp --pgdata=data2
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 0/2000060 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: created temporary replication slot "pg_basebackup_13882"
> WARNING: checksum verification failed in file "./base/12374/2610", block
> 0: calculated C2C9 but expected EC78
> WARNING: checksum verification failed in file "./base/12374/1259", block
> 0: calculated 8BAE but expected 46B8
> WARNING: checksum verification failed in file "./base/12374/1259", block
> 1: calculated E413 but expected 7701
> WARNING: checksum verification failed in file "./base/12374/1259", block
> 2: calculated 5DA9 but expected D5AA
> WARNING: checksum verification failed in file "./base/12374/1259", block
> 3: calculated 5651 but expected 4F5E
> WARNING: checksum verification failed in file "./base/12374/1259", block
> 4: calculated DF39 but expected DF00
> WARNING: further checksum verification failures in file
> "./base/12374/1259" will not be reported
> WARNING: file "./base/12374/1259" has a total of 6 checksum verification
> failures
> WARNING: 7 total checksum verification failures
> pg_basebackup: write-ahead log end point: 0/2000130
> pg_basebackup: checksum error occured
> $ echo $?
> 1
> $ du -s data2
> 38820 data2
>
> I ommitted the 'file "foo" has a total of X checksum verification
> failures' if there is only one, as seen with file "./base/12374/2610"
> above. Same with the "X total checksum verification failures" if there
> was only one.
>
> Is that ok with everybody?
>
>
I like most of this patch now :)

It might be a micro-optimisation, but ISTM that we shouldn't do the
basename(palloc(fn)) in is_heap_file() unless we actually need it -- so not
at the top of the function. (And surely "." and ".." should not occur here?
I think that's a result of copy/paste from the online checksum patch?

We also do exactly the same basename(palloc(fn)) in sendFile(). Can we
find a way to reuse that duplication? Perhaps we want to split it into the
two pieces already out in sendFile() and pass it in as separate parameters?

If not then this second one should at least only be done inside the if
(verify_checksums).

There is a bigger problem next to that though -- I believe basename() does
not exist on Win32. I haven't tested it, but there is zero documentation of
it existing, which usually indicates it doesn't. That's the part that
definitely needs to get fixed.

I think you need to look into the functionality in port/path.c, in
particular last_dir_separator()?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2018-04-01 15:46:38 Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full
Previous Message Дмитрий Воронин 2018-04-01 15:11:04 Re: Diagonal storage model