Re: Enabling Checksums

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enabling Checksums
Date: 2012-11-12 04:55:19
Message-ID: 50A08137.5090200@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/11/12 2:56 PM, Jeff Davis wrote:
> We could have a separate utility, pg_checksums, that can
> alter the state and/or do an offline verification. And initdb would take
> an option that would start everything out fully protected with
> checksums.

Adding an initdb option to start out with everything checksummed seems
an uncontroversial good first thing to have available. It seems like a
proper 9.3 target to aim at even if per-table upgrading gets bogged down
in details. I have an argument below that the area between initdb and
per-table upgrades is fundamentally uncertain and therefore not worth
chasing after, based on reasons you already started to outline. There's
not much useful middle ground there.

Won't a pg_checksums program just grow until it looks like a limited
version of vacuum though? It's going to iterate over most of the table;
it needs the same cost controls as autovacuum (and to respect the load
of concurrent autovacuum work) to keep I/O under control; and those cost
control values might change if there's a SIGHUP to reload parameters.
It looks so much like vacuum that I think there needs to be a really
compelling reason to split it into something new. Why can't this be yet
another autovacuum worker that does its thing?

> In order to get to the fully-protected state, you still need to
> somehow make sure that all of the old data is checksummed.
>
> And the "fully protected" state is important in my opinion, because
> otherwise we aren't protected against corrupt page headers that say
> they have no checksum (even when it really should have a checksum).

I think it's useful to step back for a minute and consider the larger
uncertainty an existing relation has, which amplifies just how ugly this
situation is. The best guarantee I think online checksumming can offer
is to tell the user "after transaction id X, all new data in relation R
is known to be checksummed". Unless you do this at initdb time, any
conversion case is going to have the possibility that a page is
corrupted before you get to it--whether you're adding the checksum as
part of a "let's add them while we're writing anyway" page update or the
conversion tool is hitting it.

That's why I don't think anyone will find online conversion really
useful until they've done a full sweep updating the old pages. And if
you accept that, a flexible checksum upgrade utility, one that co-exists
with autovacuum activity costs, becomes a must.

One of the really common cases I was expecting here is that conversions
are done by kicking off a slow background VACUUM CHECKSUM job that might
run in pieces. I was thinking of an approach like this:

-Initialize a last_checked_block value for each table
-Loop:
--Grab the next block after the last checked one
--When on the last block of the relation, grab an exclusive lock to
protect against race conditions with extension
--If it's marked as checksummed and the checksum matches, skip it
---Otherwise, add a checksum and write it out
--When that succeeds, update last_checked_block
--If that was the last block, save some state saying the whole table is
checkedsummed

With that logic, there is at least a forward moving pointer that removes
the uncertainty around whether pages have been updated or not. It will
keep going usefully if interrupted too. One obvious this way this can
fail is if:

1) A late page in the relation is updated and a checksummed page written
2) The page is corrupted such that the "is this checksummed?" bits are
not consistent anymore, along with other damage to it
3) The conversion process gets to this page eventually
4) The corruption of (2) isn't detected

But I think that this possibility--that a page might get quietly
corrupted after checked once, but still in the middle of checking a
relation--is both impossible to remove and a red herring. How do we
know that this page of the relation wasn't corrupted on disk before we
even started? We don't, and we can't.

The only guarantee I see that we can give for online upgrades is that
after a VACUUM CHECKSUM sweep is done, and every page is known to both
have a valid checksum on it and have its checksum bits set, *then* any
page that doesn't have both set bits and a matching checksum is garbage.
Until reaching that point, any old data is suspect. The idea of
operating in an "we'll convert on write but never convert old pages"
can't come up with any useful guarantees about data integrity that I can
see. As you say, you don't ever gain the ability to tell pages that
were checksummed but have since been corrupted from ones that were
corrupt all along in that path.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2012-11-12 05:39:17 Re: WIP checksums patch
Previous Message Peter Eisentraut 2012-11-12 03:55:49 Re: Doc patch "only relevant" -> "relevant only"