Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Online verification of checksums
Date: 2019-03-19 19:27:55
Message-ID: CAOuzzgrhZJ6kBaK7v+Xra=q7_XFbtXyQCBFrvcrsCqZZ20WFTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > It's torn pages that I am concerned about - the server is writing and
> > > we are reading, and we get a mix of old and new content. We have been
> > > quite diligent about protecting ourselves from such risks elsewhere,
> > > and checksum verification should not be held to any lesser standard.
> >
> > If we see a checksum failure on an otherwise correctly read block in
> > online mode, we retry the block on the theory that we might have read a
> > torn page. If the checksum verification still fails, we compare its LSN
> > to the LSN of the current checkpoint and don't mind if its newer. This
> > way, a torn page should not cause a false positive either way I
> > think?.
>
> False positives, no. But there's plenty potential for false
> negatives. In plenty clusters a large fraction of the pages is going to
> be touched in most checkpoints.

How is it a false negative? The page was in the middle of being written,
if we crash the page won’t be used because it’ll get replayed over by the
checkpoint, if we don’t crash then it also won’t be used until it’s been
written out completely. I don’t agree that this is in any way a false
negative- it’s simply a page that happens to be in the middle of a file
that we can skip because it isn’t going to be used. It’s not like there’s
going to be a checksum failure if the backend reads it.

Not only that, but checksums and such failures are much more likely to
happen on long dormant data, not on data that’s actively being written out
and therefore is still in the Linux FS cache and hasn’t even hit actual
storage yet anyway.

> If it is a genuine storage failure we will see it in the next
> > pg_checksums run as its LSN will be older than the checkpoint.
>
> Well, but also, by that time it might be too late to recover things. Or
> it might be a backup that you just made, that you later want to recover
> from, ...

If it’s a backup you just made then that page is going to be in the WAL and
the torn page on disk isn’t going to be used, so how is this an issue?
This is why we have WAL- to deal with torn pages.

> The basebackup checksum verification works in the same way.
>
> Shouldn't have been merged that way.

I have a hard time not finding this offensive. These issues were
considered, discussed, and well thought out, with the result being
committed after agreement.

Do you have any example cases where the code in pg_basebackup has resulted
in either a false positive or a false negative? Any case which can be
shown to result in either?

If not then I think we need to stop this, because if we can’t trust that a
torn page won’t be actually used in that torn state then it seems likely
that our entire WAL system is broken and we can’t trust the way we do
backups either and have to rewrite all of that to take precautions to lock
pages while doing a backup.

Thanks!

Stephen

>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-19 19:31:56 Re: Rare SSL failures on eelpout
Previous Message Daniel Verite 2019-03-19 19:08:13 Re: Willing to fix a PQexec() in libpq module