Re: 16-bit page checksums for 9.2

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 23:57:20
Message-ID: CA+U5nMLu3po=_5HiZCVWqgMNG84Nw=tB9VBd56_eoQQLK+UsEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> As explained in detailed comments, the purpose of this is to implement
>> Heikki's suggestion that we have a bit set to zero so we can detect
>> failures that cause a run of 1s.
>
> I think it's nonsensical to pretend that there's anything special
> about that particular bit.  If we want to validate the page header
> before trusting the lack of a checksum bit, we can do that far more
> thoroughly than just checking that one bit.  There are a whole bunch
> of bits that ought to always be zero, and there are other things we
> can validate as well (e.g. LSN not in future).  If we're concerned
> about the checksum-enabled bit getting flipped (and I agree that we
> should be), we can check some combination of that stuff in the hope of
> catching it, and that'll be a far better guard than just checking one
> arbitrarily selected bit.

I thought it was a reasonable and practical idea from Heikki. The bit
is not selected arbitrarily, it is by design adjacent to one of the
other bits. So overall, 3 bits need to be set to a precise value and a
run of 1s or 0s will throw and error.

> That having been said, I don't feel very good about the idea of
> relying on the contents of the page to tell us whether or not the page
> has a checksum.  There's no guarantee that an error that flips the
> has-checksum bit will flip any other bit on the page, or that it won't
> flip everything else we're relying on as a backstop in exactly the
> manner that foils whatever algorithm we put in place.  Random
> corruption is, perhaps, unlikely to do that, but somehow I feel like
> it'll happen more often than random chance suggests.  Things never
> fail the way you want them to.

You're right. This patch is not the best possible world, given a clean
slate. But we don't have a clean slate.

The fact is this patch will detect corruptions pretty well and that's
what Postgres users want.

While developing this, many obstacles could have been blockers. I
think we're fairly lucky that I managed to find a way through the
minefield of obstacles.

> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on.  But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed.

You can read every block and check.

> If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.

> I'm tempted to suggest a relation-level switch: when you want
> checksums, you use ALTER TABLE to turn them on, and when you don't
> want them any more you use ALTER TABLE to shut them off again, in each
> case rewriting the table.  That way, there's never any ambiguity about
> what's in the data pages in a given relation: either they're either
> all checksummed, or none of them are.  This moves the decision about
> whether checksums are enabled or disabled quite a but further away
> from the data itself, and also allows you to determine (by catalog
> inspection) which parts of the cluster do in fact have checksums.  It
> might be kind of a pain to implement, though: you'd have to pass the
> information about how any given relation was configured down to the
> place where we validate page sanity.  I'm not sure whether that's
> practical.

It's not practical as the only mechanism, given the requirement for upgrade.

As I mention in the docs, if you want that, use VACUUM FULL. So there
is a mechanism if you want it.

> I also think that the question of where exactly the checksum ought to
> be put might bear some more thought.  Tom expressed some concern about
> stealing the page version field, and it seems to me we could work
> around that by instead stealing something less valuable.  The top
> eight bits of the pd_pagesize_version field probably meet that
> criteria, since in practice basically everybody uses 8K blocks, and
> the number of errors that will be caught by checksums is probably much
> larger than the number of errors that will be caught by the page-size
> cross-check.  But maybe the other 8 bits should come from somewhere
> else, maybe pd_tli or pd_flags.  For almost all practical purposes,
> storing only the low-order 8 bits of the TLI ought to provide just as
> much of a safety check as storing the low-order 16 bits, so I think
> that might be the way to go.  We could set it up so that we always
> check only the low 8 bits against the TLI, regardless of whether
> checksums are enabled; then we basically free up that bit space at no
> cost in code complexity.

The version stuff is catered for by the current patch.

TLI isn't something I want to mess with. It does actually mean
something, unlike the page version field which carries no appreciable
information.

People want checksums. This is the best way I've come up with of
giving it to them, given all of the constraints and requests imposed
upon me. I have no doubt that a later release will redesign the page
format and do a better job, while at the same time allowing an online
upgrade mechanism to allow block changes. But that is at least 1 year
away and experience shows that is often much longer than that. I'll
buy you a $DRINK if that happens in the next release.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Urbański 2012-02-20 00:09:04 Re: Potential reference miscounts and segfaults in plpython.c
Previous Message Tom Lane 2012-02-19 23:42:03 Re: Future of our regular expression code