Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2020-11-27 16:15:27
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > > We are not just looking at one LSN value. Here are the steps we are
> > > proposing (I'll skip checks for zero pages here):
> > >
> > > 1) Test the page checksum. If it passes the page is OK.
> > > 2) If the checksum does not pass then record the page offset and LSN and
> > > continue.
> >
> > But here the checksum is broken, so while the offset is something we
> > can rely on how do you make sure that the LSN is fine? A broken
> > checksum could perfectly mean that the LSN is actually *not* fine if
> > the page header got corrupted.

Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable

> > > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > > where possible invalid pages were recorded in the first pass.
> > > a) If the page is now valid then it is OK.
> > > b) If the page is not valid but the LSN has increased from the LSN
> >
> > Per se the previous point about the LSN value that we cannot rely on.
> We cannot rely on the LSN itself. But it's a lot more likely that we
> can rely on the LSN changing, and on the LSN changing in a "correct
> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
> the new page now has a correct checksum, it's very most likely to be
> correct. You could perhaps even put cap on it saying "if the LSN
> increased, but less than <n>", where <n> is a sufficiently high number
> that it's entirely unreasonable to advanced that far between the
> reading of two blocks. But it has to have a very high margin in that
> case.

This is, in fact, included in what was proposed- the "max increase"
would be "the ending LSN of the backup". I don't think we can make it
any tighter than that though without risking false positives, which is
surely worse than a false negative in this particular case- we already
risk false negatives due to the fact that our checksum isn't perfect, so
even a perfect check to make sure that the page will, in fact, be
replayed over during crash recovery doesn't guarantee that there's no

> > > A malicious attacker could easily trick these checks, but as Stephen pointed
> > > out elsewhere they would likely make the checksums valid which would escape
> > > detection anyway.
> > >
> > > We believe that the chances of random storage corruption passing all these
> > > checks is incredibly small, but eventually we'll also check against the WAL
> > > to be completely sure.
> >
> > The lack of check for any concurrent I/O on the follow-up retries is
> > disturbing. How do you guarantee that on the second retry what you
> > have is a torn page and not something corrupted? Init forks for
> > example are made of up to 2 blocks, so the window would get short for
> > at least those. There are many instances with tables that have few
> > pages as well.

If there's an easy and cheap way to see if there was concurrent i/o
happening for the page, then let's hear it. One idea that has occured
to me which hasn't been discussed is checking the file's mtime to see if
it's changed since the backup started. In that case, I would think it'd
be something like:

- Checksum is invalid
- LSN is within range
- Close file
- Stat file
- If mtime is from before the backup then signal possible corruption

If the checksum is invalid and the LSN isn't in range, then signal

In general, however, I don't like the idea of reaching into PG and
asking PG for this page.

> Here I was more worried that the window might get *too long* if tables
> are large :)

I'm not sure that there's really a 'too long' possibility here.

> The risk is certainly that you get a torn page *again* on the second
> read. It could be the same torn page (if it hasn't changed), but you
> can detect that (by the fact that it hasn't actually changed) and
> possibly do a short delay before trying again if it gets that far.

I'm really not a fan of introducing these delays in the hopes that
they'll work..

> That could happen if the process is too quick. It could also be that
> you are unlucky and that you hit a *new* write, and you were so
> unlucky that both times it happened to hit exactly when you were
> reading the page the next time. I'm not sure the chance of that
> happening is even big enough we have to care about it, though?

If there's actually a new write, surely the LSN would be new? At the
least, it wouldn't be the same LSN as the first read that picked up a
torn page.

In general though, I agree, we are getting to the point here where the
chances of missing something with this approach seems extremely slim. I
do still like the idea of doing better by actually scanning the WAL but
at least for now, this is far better than what we have today while not
introducing a huge amount of additional code or complexity.



In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Li Japin 2020-11-27 16:21:27 Re: Printing LSN made easy
Previous Message Heikki Linnakangas 2020-11-27 15:55:25 Re: parallel distinct union and aggregate support patch