basebackup checksum verification

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: basebackup checksum verification
Date: 2019-03-26 17:08:20
Message-ID: 20190326170820.6sylklg7eh6uhabd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

As detailed in
https://postgr.es/m/20190319200050.ncuxejradurjakdc%40alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I think we need to fix this before the next set of backbranch releases,
or at the very least add a big fat warning that the feature isn't doing
much.

The more I think about it, the less convinced I am of the method to
avoid the torn page problem using LSNs. To make e.g. the PageIsNew()
check correct, we need to verify that the whole page is zeroes - but we
can't use the LSN for that, as it's not on the page. But there very well
could be a torn page issue with only the second half of the page being
written back (with the default 8kb pages that can trivially happen as
the kernel's pagecache commonly uses 4kb granularity).

I basically think it needs to work like this:

1) Perform the entire set of PageIsVerified() checks *without*
previously checking the page's LSN, but don't error out.

2) If 1) errored out, ensure that that's because the backend is
currently writing out the page. That basically requires doing what
BufferAlloc() does. So I think we'd need to end up with a function
like:

LockoutBackendWrites():
buf = ReadBufferWithoutRelcache(relfilenode);
LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
/*
* Reread page from OS, and recheck. This needs to happen while
* the IO lock prevents rereading from the OS. Note that we do
* not want to rely on the buffer contents here, as that could
* be very old cache contents.
*/
perform_checksum_check(relfilenode, ERROR);

LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
ReleaseBuffer(buf);

3) If 2) also failed, then we can be sure that the page is truly
corrupted.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2019-03-26 17:11:14 Re: pgsql: Get rid of backtracking in jsonpath_scan.l
Previous Message Alvaro Herrera 2019-03-26 16:53:02 Re: pgsql: Get rid of backtracking in jsonpath_scan.l