Re: Online verification of checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Online verification of checksums
Date: 2019-03-19 20:00:50
Message-ID: 20190319200050.ncuxejradurjakdc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > It's torn pages that I am concerned about - the server is writing and
> > > > we are reading, and we get a mix of old and new content. We have been
> > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > and checksum verification should not be held to any lesser standard.
> > >
> > > If we see a checksum failure on an otherwise correctly read block in
> > > online mode, we retry the block on the theory that we might have read a
> > > torn page. If the checksum verification still fails, we compare its LSN
> > > to the LSN of the current checkpoint and don't mind if its newer. This
> > > way, a torn page should not cause a false positive either way I
> > > think?.
> >
> > False positives, no. But there's plenty potential for false
> > negatives. In plenty clusters a large fraction of the pages is going to
> > be touched in most checkpoints.
>
>
> How is it a false negative? The page was in the middle of being
> written,

You don't actually know that. It could just be random gunk in the LSN,
and this type of logic just ignores such failures as long as the random
gunk is above the system's LSN.

And the basebackup logic doesn't just ignore if both the checksum
failed, and the lsn is between startptr and current insertion pointer -
it just does it with *any* page that has a pd_upper != 0 and a pd_lsn >
startptr. Given typical startlsn values (skewing heavily towards lower
int64s), that means that random data is more likely than not to pass
this test.

As it stands, the logic seems to give more false confidence than
anything else.

> > The basebackup checksum verification works in the same way.
> >
> > Shouldn't have been merged that way.
>
>
> I have a hard time not finding this offensive. These issues were
> considered, discussed, and well thought out, with the result being
> committed after agreement.

Well, I don't know what to tell you. But:

/*
* Only check pages which have not been modified since the
* start of the base backup. Otherwise, they might have been
* written only halfway and the checksum would not be valid.
* However, replaying WAL would reinstate the correct page in
* this case. We also skip completely new pages, since they
* don't have a checksum yet.
*/
if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{

doesn't consider plenty scenarios, as pointed out above. It'd be one
thing if the concerns I point out above were actually commented upon and
weighed not substantial enough (not that I know how). But...

> Do you have any example cases where the code in pg_basebackup has resulted
> in either a false positive or a false negative? Any case which can be
> shown to result in either?

CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
SELECT pg_relation_size('corruptme');
postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
┌─────────────────────────────────────┐
│ ?column? │
├─────────────────────────────────────┤
│ /srv/dev/pgdev-dev/base/13390/16384 │
└─────────────────────────────────────┘
(1 row)
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc

Try a basebackup and see how many times it'll detect the corrupt
data. In the vast majority of cases you're going to see checksum
failures when reading the data for normal operation, but not when using
basebackup (or this new tool).

At the very very least this would need to do

a) checks that the page is all zeroes if PageIsNew() (like
PageIsVerified() does for the backend). That avoids missing cases
where corruption just zeroed out the header, but not the whole page.
b) Check that pd_lsn is between startlsn and the insertion pointer. That
avoids accepting just about all random data.

And that'd *still* be less strenuous than what normal backends
check. And that's already not great (due to not noticing zeroed out
data).

I fail to see how it's offensive to describe this as "shouldn't have
been merged that way".

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-03-19 20:49:06 Re: Online verification of checksums
Previous Message Thomas Munro 2019-03-19 19:40:47 Re: Rare SSL failures on eelpout