Re: basebackup checksum verification

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: basebackup checksum verification
Date: 2019-03-26 23:22:03
Message-ID: 20190326232203.GD6197@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Andres Freund (andres(at)anarazel(dot)de) wrote:
> As detailed in
> https://postgr.es/m/20190319200050.ncuxejradurjakdc%40alap3.anarazel.de
> the way the backend's basebackup checksum verification works makes its
> error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis. I thought
Robert's response was generally good, pointing out that we're talking
about this being an issue if the corruption happens in a certain set of
bytes. That said, I'm happy to see improvements in this area but I'm
flat out upset about the notion that we must be perfect here- our
checksums themselves aren't perfect for catching corruption either.

> I think we need to fix this before the next set of backbranch releases,
> or at the very least add a big fat warning that the feature isn't doing
> much.

I disagree about this level of urgency, but if you have a decent idea
about how to improve the situation, I'm fully in support of it.

> The more I think about it, the less convinced I am of the method to
> avoid the torn page problem using LSNs. To make e.g. the PageIsNew()
> check correct, we need to verify that the whole page is zeroes - but we
> can't use the LSN for that, as it's not on the page. But there very well
> could be a torn page issue with only the second half of the page being
> written back (with the default 8kb pages that can trivially happen as
> the kernel's pagecache commonly uses 4kb granularity).
>
> I basically think it needs to work like this:
>
> 1) Perform the entire set of PageIsVerified() checks *without*
> previously checking the page's LSN, but don't error out.

Performing the PageIsVerified() checks seems reasonable, I don't see any
downside to doing that, so if you'd like to add that, sure, go for it.

> 2) If 1) errored out, ensure that that's because the backend is
> currently writing out the page. That basically requires doing what
> BufferAlloc() does. So I think we'd need to end up with a function
> like:
>
> LockoutBackendWrites():
> buf = ReadBufferWithoutRelcache(relfilenode);

This is going to cause it to be pulled into shared buffers, if it isn't
already there, isn't it? That seems less than ideal and isn't it going
to end up just doing exactly the same PageIsVerified() call, and what
happens when that fails? You're going to end up getting an
ereport(ERROR) and long-jump out of here... Depending on the other
code, maybe that's something you can manage, but it seems rather tricky
to me. I do think, as was discussed extensively previously, that the
backup *should* continue even in the face of corruption, but there
should be warnings issued to notify the user of the issues.

> LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
> /*
> * Reread page from OS, and recheck. This needs to happen while
> * the IO lock prevents rereading from the OS. Note that we do
> * not want to rely on the buffer contents here, as that could
> * be very old cache contents.
> */
> perform_checksum_check(relfilenode, ERROR);
>
> LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
> ReleaseBuffer(buf);

I don't particularly like having to lock pages in this way while
performing this check, espectially with having to read the page into
shared buffers potentially.

This also isn't the only approach to dealing with this issue that the
LSN might be corrupted. There's at least two other ways we can improve
the situation here- we can keep track of the highest LSN seen, perhaps
on a per-file basis, and then compare those to the end-of-backup LSN,
and issue a warning or perform a re-check or do something else if we
discover that the LSN found was later than the end-of-backup LSN.
That's not perfect, but it's certainly a good improvement over what we
have today. The other approach would be to track all of the pages
which were skipped and then compare them to the pages in the WAL which
were archived during the backup, making sure that all pages which failed
checksum exist in the WAL. That should allow us to confirm that the
page was actually being modified and won't ever be used in the state
that we saw it in, since it'll be replayed over by WAL, and therefore we
don't have to worry about the LSN or the page itself being corrupt. Of
course, that requires tracking all the pages which are modified by the
WAL for the duration of the backup, and tracking all the pages which
failed checksum and/or other validation, and then performing the
cross-check. That seems like a fair bit of work for this, but I'm not
sure that it's avoidable, ultimately.

I'm happy with incremental improvements in this area though, and just
checking that the LSN of pages skipped isn't completely insane would
definitely be a good improvement to begin with and might be simple
enough to back-patch. I don't think back-patching changes like those
proposed here is a good idea. I don't have any problem adding
additional documentation to explain what's being done though, with
appropriate caveats at how this might not catch all types of corruption
(we should do the same for the checksum feature itself, if we don't
already have such caveats...).

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-03-26 23:27:33 Re: partitioned tables referenced by FKs
Previous Message legrand legrand 2019-03-26 23:21:21 RE: Planning counters in pg_stat_statements (using pgss_store)