Re: pg_basebackup misses to report checksum error

From: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_basebackup misses to report checksum error
Date: 2020-05-07 18:18:48
Message-ID: CALfoeis7zxF+qw1LG46QOrooajTZQJJsb-CZedR1usnF06objg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 7, 2020 at 10:25 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Greetings,
>
> * Ashwin Agrawal (aagrawal(at)pivotal(dot)io) wrote:
> > On Wed, May 6, 2020 at 3:02 PM Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > > On Wed, May 6, 2020 at 5:48 PM Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
> wrote:
> > > > If pg_basebackup is not able to read BLCKSZ content from file, then
> it
> > > > just emits a warning "could not verify checksum in file "____" block
> > > > X: read buffer size X and page size 8192 differ" currently but misses
> > > > to error with "checksum error occurred". Only if it can read 8192 and
> > > > checksum mismatch happens will it error in the end.
> > >
> > > I don't think it's a good idea to conflate "hey, we can't checksum
> > > this because the size is strange" with "hey, the checksum didn't
> > > match". Suppose the a file has 1000 full blocks and a partial block.
> > > All 1000 blocks have good checksums. With your change, ISTM that we'd
> > > first emit a warning saying that the checksum couldn't be verified,
> > > and then we'd emit a second warning saying that there was 1 checksum
> > > verification failure, which would also be reported to the stats
> > > system. I don't think that's what we want.
> >
> > I feel the intent of reporting "total checksum verification failure" is
> to
> > report corruption. Which way is the secondary piece of the puzzle. Not
> > being able to read checksum itself to verify is also corruption and is
> > checksum verification failure I think. WARNINGs will provide fine grained
> > clarity on what type of checksum verification failure it is, so I am not
> > sure we really need fine grained clarity in "total numbers" to
> > differentiate these two types.
>
> Are we absolutely sure that there's no way for a partial block to end up
> being seen by pg_basebackup, which is just doing routine filesystem
> read() calls, during normal operation though..? Across all platforms?
>

Okay, that's a good point, I didn't think about it. This comment to skip
verifying checksum, I suppose convinces, can't be sure and hence can't
report partial blocks as corruption.

/*
* Only check pages which have not been modified since the
* start of the base backup. Otherwise, they might have been
* written only halfway and the checksum would not be valid.
* However, replaying WAL would reinstate the correct page in
* this case. We also skip completely new pages, since they
* don't have a checksum yet.
*/

Might be nice to have a similar comment for the partial block case to
document why we can't report it as corruption. Thanks.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-05-07 18:54:12 Re: PG 13 release notes, first draft
Previous Message Chapman Flack 2020-05-07 18:08:58 Re: PG 13 release notes, first draft