Re: pgsql: Validate page level checksums in base backups

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgsql: Validate page level checksums in base backups
Date: 2018-04-10 21:24:35
Message-ID: f23e92ec-118d-e6ea-0c81-876ad62a588a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Hi,

I think there's a bug in sendFile(). We do check checksums on all pages
that pass this LSN check:

/*
* Only check pages which have not been modified since the
* start of the base backup. Otherwise, they might have been
* written only halfway and the checksum would not be valid.
* However, replaying WAL would reinstate the correct page in
* this case.
*/
if (PageGetLSN(page) < startptr)
{
...
}

Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0
too, and we'll try to verify the checksum - but we actually do not set
checksums on empty pages.

So I think it should be something like this:

if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr))
{
...
}

It might be worth verifying that the page is actually all-zeroes (and
not just with corrupted pd_upper value. Not sure it's worth it.

I've found this by fairly trivial stress testing - running pgbench and
pg_basebackup in a loop. It was failing pretty reliably (~75% of runs).
With the proposed change I see no further failures.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tomas Vondra 2018-04-10 21:35:21 Re: pgsql: Validate page level checksums in base backups
Previous Message Tom Lane 2018-04-10 20:15:19 pgsql: Put back parallel-safety guards in plpython and src/test/regress

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-04-10 21:32:03 Re: [HACKERS] Runtime Partition Pruning
Previous Message Chapman Flack 2018-04-10 21:19:36 Re: lazy detoasting