Re: Online verification of checksums

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2018-09-17 17:00:31
Message-ID: ab83e470-4077-a6c1-2dbc-083d7df274ac@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 09/17/2018 06:42 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>> On 09/17/2018 04:46 PM, Stephen Frost wrote:
>>> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
>>>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>>>> Obviously, pg_verify_checksums can't do that easily because it's
>>>>> supposed to run from outside the database instance.
>>>>
>>>> It reads pg_control anyway, so couldn't we just take
>>>> ControlFile->checkPoint?
>>>>
>>>> Other than that, basebackup.c seems to only look at pages which haven't
>>>> been changed since the backup starting checkpoint (see above if
>>>> statement). That's reasonable for backups, but is it just as reasonable
>>>> for online verification?
>>>
>>> Right, basebackup doesn't need to look at other pages.
>>>
>>>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>>>> the retry it simply re-reads the page again, verifies the checksum and
>>>>> reports an error. Which is broken, because the newly read page might be
>>>>> torn again due to a concurrent write.
>>>>
>>>> Well, ok.
>>>
>>> The newly read page will have an updated LSN though then on the re-read,
>>> in which case basebackup can know that what happened was a rewrite of
>>> the page and it no longer has to care about the page and can skip it.
>>>
>>> I haven't looked, but if basebackup isn't checking the LSN again for the
>>> newly read page then that'd be broken, but I believe it does (at least,
>>> that's the algorithm we came up with for pgBackRest, and I know David
>>> shared that when the basebackup code was being written).
>>
>> Yes, basebackup does check the LSN on re-read, and skips the page if it
>> changed on re-read (because it eliminates the consistency guarantees
>> provided by the checkpoint).
>
> Ok, good, though I'm not sure what you mean by 'eliminates the
> consistency guarantees provided by the checkpoint'. The point is that
> the page will be in the WAL and the WAL will be replayed during the
> restore of the backup.
>

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

>>>> So how about we do check every page, but if one fails on retry, and the
>>>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>>>
>>> I thought that's what basebackup did- if it doesn't do that today, then
>>> it really should.
>>
>> The crucial distinction here is that the trick is not in comparing LSNs
>> from the two page reads, but comparing it to the checkpoint LSN. If it's
>> greater, the page may be torn or broken, and there's no way to know
>> which case it is - so basebackup simply skips it.
>
> Sure, because we don't care about it any longer- that page isn't
> interesting because the WAL will replay over it. IIRC it actually goes
> something like: check the checksum, if it failed then check if the LSN
> is greater than the checkpoint (of the backup start..), if not, then
> re-read, if the LSN is now newer than the checkpoint then skip, if the
> LSN is the same then throw an error.
>

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

>>>> In any case, if we decide we really should skip the page if it is newer
>>>> than the checkpoint, I think it makes sense to track those skipped pages
>>>> and print their number out at the end, if there are any.
>>>
>>> Not sure what the point of this is. If we wanted to really do something
>>> to cross-check here, we'd track the pages that were skipped and then
>>> look through the WAL to make sure that they're there. That's something
>>> we've talked about doing with pgBackRest, but don't currently.
>>
>> I agree simply printing the page numbers seems rather useless. What we
>> could do is remember which pages we skipped and then try checking them
>> after another checkpoint. Or something like that.
>
> I'm still not sure I'm seeing the point of that. They're still going to
> almost certainly be in the kernel cache. The reason for checking
> against the WAL would be to detect errors in PG where we aren't putting
> a page into the WAL when it really should be, or something similar,
> which seems like it at least could be useful.
>
> Maybe to put it another way- there's very little point in checking the
> checksum of a page which we know must be re-written during recovery to
> get to a consistent point. I don't think it hurts in the general case,
> but I wouldn't write a lot of code which then needs to be tested to
> handle it. I also don't think that we really need to make
> pg_verify_checksum spend lots of extra cycles trying to verify that
> *every* page had its checksum validated when we know that lots of pages
> are going to be in memory marked dirty and our checking of them will be
> ultimately pointless as they'll either be written out by the
> checkpointer or some other process, or we'll replay them from the WAL if
> we crash.
>

Yeah, I agree.

>>>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>>>> to protect against anything, really.
>>>>
>>>> Well, I've noticed that without it I get sporadic checksum failures on
>>>> reread, so I've added it to make them go away. It was certainly a
>>>> phenomenological decision that I am happy to trade for a better one.
>>>
>>> That then sounds like we really aren't re-checking the LSN, and we
>>> really should be, to avoid getting these sporadic checksum failures on
>>> reread..
>>
>> Again, it's not enough to check the LSN against the preceding read. We
>> need a checkpoint LSN or something like that.
>
> I actually tend to disagree with you that, for this purpose, it's
> actually necessary to check against the checkpoint LSN- if the LSN
> changed and everything is operating correctly then the new LSN must be
> more recent than the last checkpoint location or things are broken
> badly.
>

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

> Now, that said, I do think it's a good *idea* to check against the
> checkpoint LSN (presuming this is for online checking of checksums- for
> basebackup, we could just check against the backup-start LSN as anything
> after that point will be rewritten by WAL anyway). The reason that I
> think it's a good idea to check against the checkpoint LSN is that we'd
> want to throw a big warning if the kernel is just feeding us random
> garbage on reads and only finding a difference between two reads isn't
> really doing any kind of validation, whereas checking against the
> checkpoint-LSN would at least give us some idea that the value being
> read isn't completely ridiculous.
>
> When it comes to if the pg_sleep() is necessary or not, I have to admit
> to being unsure about that.. I could see how it might be but it seems a
> bit surprising- I'd probably want to see exactly what the page was at
> the time of the failure and at the time of the second (no-sleep) re-read
> and then after a delay and convince myself that it was just an unlucky
> case of being scheduled in twice to read that page before the process
> writing it out got a chance to finish the write.
>

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Douglas Doole 2018-09-17 17:01:13 Re: Collation versioning
Previous Message Stephen Frost 2018-09-17 16:42:46 Re: Online verification of checksums