Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Online verification of checksums
Date: 2020-11-24 01:54:41
Message-ID: CAOuzzgpJ1X7oXZXGy+Rkv52M=ZiRRWUF5gkD99=cCswwt5idRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> > * Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
> >> Yes and this is a tricky part. Until you have explained it in your
> latest
> >> message, I wasn't sure how we can distinct concurrent update from a page
> >> header corruption. Now I agree that if page LSN updated and increased
> >> between rereads, it is safe enough to conclude that we have some
> concurrent
> >> load.
> >
> > Even in this case, it's almost free to compare the LSN to the starting
> > backup LSN, and to the current LSN position, and make sure it's
> > somewhere between the two. While that doesn't entirely eliminite the
> > possibility that the page happened to get corrupted *and* return a
> > different result on subsequent reads *and* that it was corrupted in such
> > a way that the LSN ended up falling between the starting backup LSN and
> > the current LSN, it's certainly reducing the chances of a false negative
> > a fair bit.
>
> FWIW, I am not much a fan of designs that are not bullet-proof by
> design. This reduces the odds of problems, sure, still it does not
> discard the possibility of incorrect results, confusing users as well
> as people looking at such reports.

Let’s be clear about this- our checksums are, themselves, far from
bulletproof, regardless of all of our other efforts. They are not
foolproof against any corruption, and certainly not even close to being
sufficient for guarantees you’d expect in, say, encryption integrity. We
cannot say with certainty that a page which passes checksum validation
isn’t corrupted in some way. A page which doesn’t pass checksum validation
may be corrupted or may be torn and we aren’t 100% of that either, but we
can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
> >> ascending LSNs. Can you submit a patch?
> >
> > Probably would make sense to give Michael an opportunity to comment and
> > get his thoughts on this, and for him to update the patch if he agrees.
>
> I think that a LSN check would be a safe thing to do iff pd_checksum
> is already checked first to make sure that the page contents are fine
> to use. Still, what's the point in doing a LSN check anyway if we
> know that the checksum is valid? Then on a retry if the first attempt
> failed you also need the guarantee that there is zero concurrent I/O
> activity while a page is rechecked (no need to do that unless the
> initial page check doing a checksum match failed). So the retry needs
> to do some s_b interactions, but then comes the much trickier point of
> concurrent smgrwrite() calls bypassing the shared buffers.

I agree that the LSN check isn’t interesting if the page passes the
checksum validation. I do think we can look at the LSN and make reasonable
inferences based off of it even if the checksum doesn’t validate- in
particular, in my experience at least, the result of a read, without any
intervening write, is very likely to be the same if performed multiple
times quickly even if there is latent storage corruption- due to cache’ing,
if nothing else. What’s interesting about the LSN check is that we are
specifically looking to see if it *changed* in a reasonable and predictable
manner, and that it was replaced with a new yet reasonable value. The
chances of that happening due to latent storage corruption is vanishingly
small.

> As it relates to pgbackrest, we're currently contemplating having a
> > higher level loop which, upon detecting any page with an invalid
> > checksum, continues to scan to the end of that file and perform the
> > compression, encryption, et al, but then loops back after we've
> > completed that file and skips through the file again, re-reading those
> > pages which didn't have a valid checksum the first time to see if their
> > LSN has changed and is within the range of the backup. This will
> > certainly give more opportunity for the kernel to 'catch up', if needed,
> > and give us an updated page without a random 100ms delay, and will also
> > make it easier for us to, eventually, check and make sure the page was
> > in the WAL that was been produced as part of the backup, to give us a
> > complete guarantee that the contents of this page don't matter and that
> > the failed checksum isn't a sign of latent storage corruption.
>
> That would reduce the likelyhood of facing torn pages, still you
> cannot fully discard the problem either as a same page may get changed
> again once you loop over, no? And what if a corruption has updated
> pd_lsn on-disk? Unlikely so, still possible.

We surely don’t care about a page which has been changed multiple times by
PG during the backup, since all those changes will be, by definition, in
the WAL, no? Therefore, one loop to see that the value of the LSN
*changed*, meaning something wrote something new there, with a cross-check
to see that the LSN was in the expected range, is going an awfully long way
to assuring that this isn’t a case of latent storage corruption. If there
is an attacker who is not the PG process but who is modifying files then,
yes, that’s a risk, and won’t be picked up by this, but why would they
create an invalid checksum in the first place..?

We aren’t attempting to protect against a sophisticated attack, we are
trying to detect latent storage corruption.

I would also ask for a clarification as to if you feel that checking the
WAL for the page to be insufficient somehow, since I mentioned that as also
being on the roadmap. If there’s some reason that checking the WAL for the
page wouldn’t be sufficient, I am anxious to understand that reasoning.

Thanks,

Stephen

>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-11-24 01:57:40 Re: abstract Unix-domain sockets
Previous Message Tomas Vondra 2020-11-24 01:44:45 Re: POC: postgres_fdw insert batching