Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Online verification of checksums
Date: 2018-09-17 18:09:45
Message-ID: CAOuzzgq0zxW7-26bKz=2ER8n34Y0iVG2gmB78pOfg0wACKM9Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

On Mon, Sep 17, 2018 at 13:38 Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> so, trying some intermediate summary here, sorry for (also) top-posting:
>
> 1. the basebackup checksum verification logic only checks pages not
> changed since the checkpoint, which makes sense for the basebackup.

Right. I’m tending towards the idea that this also be adopted for
pg_verify_checksums.

2. However, it would be desirable to go further for pg_verify_checksums
> and (re-)check all pages.

Maybe. I’m not entirely convinced that it’s all that useful.

3. pg_verify_checksums should read the checkpoint LSN on startup and
> compare the page LSN against it on re-read, and discard pages which have
> checksum failures but are new. (Maybe it should read new checkpoint LSNs
> as they come in during its runtime as well? See below).

I’m not sure that we really need to but I’m not against it either- but in
that case you’re definitely going to see checksum failures on torn pages.

4. The pg_sleep should go.

I know that pgbackrest does not have a sleep currently and we’ve not yet
seen or been able to reproduce this case where, on a reread, we still see
an older LSN, but we check the LSN first also. If it’s possible that the
LSN still hasn’t changed on the reread then maybe we do need to have a
sleep to force ourselves off CPU to allow the other process to finish
writing, or maybe finish the file and come back around to these pages
later, but we have yet to see this behavior in the wild anywhere, nor have
we been able to reproduce it.

5. There seems to be no consensus on whether the number of skipped pages
> should be summarized at the end.

I agree with printing the number of skipped pages, that does seem like a
nice to have. I don’t know that actually printing the pages themselves is
all that useful though.

Further comments:
>
> Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> > On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > > * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> > > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > > Without the checkpoint that's not guaranteed, and simply re-reading
> the
> > > > page and rechecking it vs. the first read does not help:
> > > >
> > > > 1) write the first 512B of the page (sector), which includes the LSN
> > > >
> > > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > > >
> > > > 3) the checksum verification fails
> > > >
> > > > 4) read the page again (possibly reading a bit more new data)
> > > >
> > > > 5) the LSN did not change compared to the first read, yet the
> checksum
> > > > still fails
> > >
> > > So, I agree with all of the above though I've found it to be extremely
> > > rare to get a single read which you've managed to catch part-way
> through
> > > a write, getting multiple of them over a period of time strikes me as
> > > even more unlikely. Still, if we can come up with a solution to solve
> > > all of this, great, but I'm not sure that I'm hearing one.
> >
> > I don't recall claiming catching many such torn pages - I'm sure it's
> > not very common in most workloads. But I suspect constructing workloads
> > hitting them regularly is not very difficult either (something with a
> > lot of churn in shared buffers should do the trick).
> >
> > > > > Sure, because we don't care about it any longer- that page isn't
> > > > > interesting because the WAL will replay over it. IIRC it actually
> goes
> > > > > something like: check the checksum, if it failed then check if the
> LSN
> > > > > is greater than the checkpoint (of the backup start..), if not,
> then
> > > > > re-read, if the LSN is now newer than the checkpoint then skip, if
> the
> > > > > LSN is the same then throw an error.
> > > >
> > > > Nope, we only verify the checksum if it's LSN precedes the
> checkpoint:
> > > >
> > > >
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> > >
> > > That seems like it's leaving something on the table, but, to be fair,
> we
> > > know that all of those pages should be rewritten by WAL anyway so they
> > > aren't all that interesting to us, particularly in the basebackup case.
> >
> > Yep.
>
> Right, see point 1 above.
>
> > > > > I actually tend to disagree with you that, for this purpose, it's
> > > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > > changed and everything is operating correctly then the new LSN
> must be
> > > > > more recent than the last checkpoint location or things are broken
> > > > > badly.
> > > >
> > > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > > >
> > > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > > change between the two reads, but that's not a guarantee the page was
> > > > not torn. The example I posted earlier in this message illustrates
> that.
> > >
> > > I agree that there's some risk there, but it's certainly much less
> > > likely.
> >
> > Well. If we're going to report a checksum failure, we better be sure it
> > actually is a broken page. I don't want users to start chasing bogus
> > data corruption issues.
>
> I agree.
>
> > > > > Now, that said, I do think it's a good *idea* to check against the
> > > > > checkpoint LSN (presuming this is for online checking of
> checksums- for
> > > > > basebackup, we could just check against the backup-start LSN as
> anything
> > > > > after that point will be rewritten by WAL anyway). The reason
> that I
> > > > > think it's a good idea to check against the checkpoint LSN is that
> we'd
> > > > > want to throw a big warning if the kernel is just feeding us random
> > > > > garbage on reads and only finding a difference between two reads
> isn't
> > > > > really doing any kind of validation, whereas checking against the
> > > > > checkpoint-LSN would at least give us some idea that the value
> being
> > > > > read isn't completely ridiculous.
>
> Are you suggesting here that we always check against the current
> checkpoint, or is checking against the checkpoint that we saw at startup
> enough? I think re-reading pg_control all the time might be more
> errorprone that what we could get from this, so I would prefer not to do
> this.

I don’t follow why rereading pg_control would be error-prone. That said, I
don’t have a particularly strong opinion either way on this.

> > > > When it comes to if the pg_sleep() is necessary or not, I have to
> admit
> > > > > to being unsure about that.. I could see how it might be but it
> seems a
> > > > > bit surprising- I'd probably want to see exactly what the page was
> at
> > > > > the time of the failure and at the time of the second (no-sleep)
> re-read
> > > > > and then after a delay and convince myself that it was just an
> unlucky
> > > > > case of being scheduled in twice to read that page before the
> process
> > > > > writing it out got a chance to finish the write.
> > > >
> > > > I think the pg_sleep() is a pretty strong sign there's something
> broken.
> > > > At the very least, it's likely to misbehave on machines with
> different
> > > > timings, machines under memory and/or memory pressure, etc.
>
> I swapped out the pg_sleep earlier today for the check-against-
> checkpoint-LSN-on-reread, and that seems to work just as fine, at least
> in the tests I ran.

Ok, this sounds like you were probably seeing normal forward torn pages,
and we have certainly seen that before.

> > If we assume that what you've outlined above is a serious enough issue
> > > that we have to address it, and do so without a pg_sleep(), then I
> think
> > > we have to bake into this a way for the process to check with PG as to
> > > what the page's current LSN is, in shared buffers, because that's the
> > > only place where we've got the locking required to ensure that we don't
> > > end up with a read of a partially written page, and I'm really not
> > > entirely convinced that we need to go to that level. It'd certainly
> add
> > > a huge amount of additional complexity for what appears to be a quite
> > > unlikely gain.
> > >
> > > I'll chat w/ David shortly about this again though and get his thoughts
> > > on it. This is certainly an area we've spent time thinking about but
> > > are obviously also open to finding a better solution.
> >
> > Why not to simply look at the last checkpoint LSN and use that the same
> > way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
>
> Right.

This is fine if you know the kernel will always write the first page first,
or you accept that a reread of a page which isn’t valid will always result
in seeing a completely updated page.

We’ve made the assumption that a reread on a failure where the LSN on the
first read was older than the backup-start LSN will give us an updated
first-half of the page which we then check the LSN, of, but we have yet to
prove that this is actually possible.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2018-09-17 18:27:23 Re: Stored procedures and out parameters
Previous Message Michael Banck 2018-09-17 17:38:20 Re: Online verification of checksums