Re: Online verification of checksums

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2018-09-03 20:29:18
Message-ID: 3d9a49ce-f7ba-fbbb-409a-4d3d395493e8@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{
... verify the page checksum
}

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance. But the startptr
detail is pretty important because it supports this retry reasoning:

/*
* Retry the block on the first failure. It's
* possible that we read the first 4K page of the
* block just before postgres updated the entire block
* so it ends up looking torn to us. We only need to
* retry once because the LSN should be updated to
* something we can ignore on the next pass. If the
* error happens again then it is a true validation
* failure.
*/

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-09-03 20:36:34 Re: Caching query plan costs
Previous Message Alexander Korotkov 2018-09-03 20:05:23 Re: A strange GiST error message or fillfactor of GiST build