Re: [PATCH] Verify Checksums during Basebackups

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: David Steele <david(at)pgmasters(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-03-30 11:46:02
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> On Fri, Mar 30, 2018 at 5:35 AM, David Steele <david(at)pgmasters(dot)net> wrote:
> > On 3/24/18 10:32 AM, Michael Banck wrote:
> > > Am Freitag, den 23.03.2018, 17:43 +0100 schrieb Michael Banck:
> > >> Am Freitag, den 23.03.2018, 10:54 -0400 schrieb David Steele:
> > >>> In my experience actual block errors are relatively rare, so there
> > >>> aren't likely to be more than a few in a file. More common are
> > >>> overwritten or transposed files, rogue files, etc. These produce a lot
> > >>> of output.
> > >>>
> > >>> Maybe stop after five?
> > >
> > > The attached patch does that, and outputs the total number of
> > > verification failures of that file after it got sent.
> > >
> > >> I'm on board with this, but I have the feeling that this is not a very
> > >> common pattern in Postgres, or might not be project style at all. I
> > >> can't remember even seen an error message like that.
> > >>
> > >> Anybody know whether we're doing this in a similar fashion elsewhere?
> > >
> > > I tried to have look around and couldn't find any examples, so I'm not
> > > sure that patch should go in. On the other hand, we abort on checksum
> > > failures usually (in pg_dump e.g.), so limiting the number of warnings
> > > does makes sense.
> > >
> > > I guess we need to see what others think.
> >
> > Well, at this point I would say silence more or less gives consent.
> >
> > Can you provide a rebased patch with the validation retry and warning
> > limiting logic added? I would like to take another pass through it but I
> > think this is getting close.
> I was meaning to mention it, but ran out of cycles.
> I think this is the right way to do it, except the 5 should be a #define
> and not an inline hardcoded value :) We could argue whether it should be "5
> total" or "5 per file". When I read the emails I thought it was going to be
> 5 total, but I see the implementation does 5 / file. In a super-damanged
> system that will still lead to horrible amounts of logging, but I think
> maybe if your system is in that bad shape, then it's a lost cause anyway.

5/file seems reasonable to me as well.

> I also think the "total number of checksum errors" should be logged if
> they're >0, not >5. And I think *that* one should be logged at the end of
> the entire process, not per file. That'd be the kind of output that would
> be the most interesting, I think (e.g. if I have it spread out with 1 block
> each across 4 files, I want that logged at the end because it's easy to
> otherwise miss one or two of them that may have happened a long time apart).

I definitely like having a total # of checksum errors included at the
end, if there are any at all. When someone is looking to see why the
process returned a non-zero exit code, they're likely to start looking
at the end of the log, so having that easily available and clear as to
why the backup failed is definitely valuable.

> I don't think we have a good comparison elsewhere, and that is as David
> mention because other codepaths fail hard when they run into something like
> that. And we explicitly want to *not* fail hard, per previous discussion.




In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2018-03-30 12:05:52 Re: Cast jsonb to numeric, int, float, bool
Previous Message Teodor Sigaev 2018-03-30 11:28:43 Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)