Re: Checksums, state of play

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums, state of play
Date: 2012-03-06 14:25:17
Message-ID: CA+TgmoYaA-gaoM+pxytrekLuGDr6h+bzygPBTATnuHKfEF5G_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 5, 2012 at 10:03 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> 1. We don't need them because there will be something better in a
> later release. I don't think anybody disagrees that a better solution
> is possible in the future; doubts have been expressed as to what will
> be required and when that is likely to happen. Opinions differ. We can
> and should do something now unless there is reason not to.

I don't think there's really a serious problem here. Everyone seems
to agree that checksums are useful, and nobody even seems terribly
unhappy with the fact that this is only a 16-bit checksum. On the
flip side I think it's quite likely that we will never wish to support
multiple checksum implementations, so for that reason I do believe
that whatever we commit first in this area needs to be something we
can live with for a very long time. It can have limitations of
course, but we should be comfortable with the on-disk format and basic
design of the feature.

> 2. Turning checksums on/off/on/off in rapid succession can cause false
> positive reports of checksum failure if crashes occur and are ignored.
> That may lead to the feature and PostgreSQL being held in disrepute.

This I do think is a problem, although not for precisely the reason
stated here. In my experience, in data corruption situations, the
first thing customers do is blame PostgreSQL: they don't believe it's
the hardware; they accuse us of having bugs in our code. Having a
checksum feature would be valuable, because, first, we'd perhaps
detect problems sooner and, second, people understand what checksums
are and that checksum failures really shouldn't happen unless the
hardware is bad. More generally, one of the purposes of checksums is
to distinguish hardware failure from other possible causes of data
corruption problems. If there are code paths where checksum failures
can happy despite the hardware being good, I think that the patch will
fail to accomplish its goal of giving us confidence that the hardware
is bad.

> This can be resolved, if desired, by having having a two-stage
> enabling process where we must issue a command that scans every block
> in the database before checksum checking begins. VACUUM is easily
> modified to the task, we just need to agree that is suitable and agree
> syntax.
> A suggestion is VACUUM ENABLE CHECKSUMS; others are possible.

If we're going to extend the syntax of VACUUM, we should use the
extensible-options syntax there - i.e. something like VACUUM
(CHECKSUMS ON). But overall I think this proposal lacks in detail.
It seems to me that there are four states that a database (or table,
if we made it more fine-grained) could be in:

1. No blocks have checksums.

2. Checksums are being added, but some blocks may not yet have them.
Thus, it's not an error for a block to have no checksum, but if a
block appears to have a checksum, it should be correct. All blocks
are written with checksums.

3. Checksums are fully enabled. Checksum failures are still an
error, and apparent lack of a checksum is now an error as well, since
it must indicate something's been corrupted. All blocks are written
with checksums.

4. Checksums are being removed, but some blocks may still have them.
Thus, it's not an error for a block to have no checksum, but any
still-remaining checksums should be correct (though possibly we ought
not to complain if they aren't, to provide a recovery path for users
who are turning checksums off because they're getting errors they
don't want). Any block that's written is written without checksums.

I think we need to be clear about how the system transitions between
these states. In the current patch, AIUI, you can effectively go from
1->2 or 4->2 by setting page_checksums=on and from 2->4 by setting
page_checksums=off, but there's no easy way to ensure that you've
reached state 3 or that you've gotten back to state 1. Some variant
of VACUUM seems like a good way of doing that, but it doesn't make
sense for example to have page_checksums=off and do VACUUM (CHECKSUMS
ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF). I
guess we could just reject those as error cases, but it would be
nicer, I think, to have an interface with a higher degree of
orthogonality.

There's probably more than one way to do that, but my personal
preference, as previously noted, is to make this a table-level option,
rather than a GUC. Then, VACUUM (CHECKSUMS ON) can first change the
pg_class entry to indicate that checksums are enabling-in-progress
(i.e. 1->2), then scan the table, adding checksums, and then mark
checksums as fully enabled (i.e. 2->3). VACUUM (CHECKSUMS OFF) can
proceed in symmetric fashion, marking checksums as
disabling-in-progress (3->4), then scanning the table and getting rid
of them, and then marking them fully disabled (4->1). If a crash
happens in the middle somewhere, the state of the table can get left
as enabling-in-progress or disabling-in-progress, but a new VACUUM
(CHECKSUMS X) can be used to finish the process, and we always know
exactly where we're at.

> 3. Pages with checksums set need to have a version marking to show
> that they are a later version of the page layout. That version number
> needs to be extensible to many later versions. Pages of multiple
> versions need to exist within the server to allow simple upgrades and
> migration.

Tom proposed on another thread that we not bump the page version for
this, and I agree that shouldn't be necessary. We can just repurpose
some of the available bit space. I personally have come around to the
view that Tom's proposal of using pd_tli for this purpose would
sensible. As far as I can see, the only place where the page's TLI is
read in the entire source base is in the nbtree code, which does this:

PageSetTLI(leftpage, PageGetTLI(origpage));

And even in that case, nothing is checked or validated; it's just
copying the value from one page to another. That seems pretty
worthless, and certainly a lot less valuable that a checksum. There
are other things we could steal from as well - e.g. the high bits of
the page version, or pd_flags - but I think it's a lot more likely
that we'll want to add flags than that we'll want to use the TLI for
something. Previous page format changes have already stolen from the
TLI area, which used to be bigger, so evidently other people who have
made similar changes in the past found it a worthy place to steal bits
from.

> 4. Checksums that are dependent upon a bit setting on the block are
> somewhat fragile. Requests have been made to add bits in certain
> positions and also to remove them again. No set of bits seems to
> please everyone.
>
> (3) and (4) are in conflict with each other, but there is a solution.
> We mark the block with a version number, but we don't make the
> checking dependant upon the version number. We simply avoid making any
> checks until the command to scan all blocks is complete, per point
> (2). That way we need to use 1 flag bit to mark the new version and
> zero flag bits to indicate checks should happen.

I generally agree with this outline, though I think that in lieu of a
version number we could simply set a new pd_flags bit indicating that
checksums are enabled. If we haven't fully enabled checksums yet,
then the fact that this bit isn't set is not an error; but if
checksums are fully enabled, then every page must have that bit set,
and any page that doesn't is ipso facto corrupt.

> 5. The part of the page header that can be used as a checksum has been
> disputed. Using the 16 bits dedicated to a version number seems like
> the least useful consecutive 2 bytes of data in the page header. It
> can't be < 16 bits because that wouldn't be an effective checksum for
> database blocks. We might prefer 32 bits, but that would require use
> of some other parts of the page header and possibly split that into
> two parts. Splitting the checksum into 2 parts will cause the code to
> be more complex and fragile.

For the reasons stated above, I believe pd_tli is less useful than
pd_pagesize_version. I fear that if we repurpose pd_pagesize_version,
we're going to make things very difficult for people who want to write
block-inspection tools, like pg_filedump or pageinspect. Right now,
it's possible to look at that offset within the block and know for
certain what page version you're dealing with. If we repurpose it to
hold checksum data, that will no longer be possible. Unlike pd_tli,
pd_pagesize_version is validated every time we read a block.

> 6. Performance impacts. Measured to be a small regression.
>
> 7. Hint bit setting requires WAL logging. The worst case for that
> would be setting hints on newly loaded tables. Work has been done on
> other patches to remove that case. If those don't fly, this would be a
> cost paid by those that wish to take advantage of this feature.

The pgbench runs that Peter posted looked pretty good, but it's
subsequently occurred to me that pgbench isn't a very good test case
for assessing the cost of WAL-logging hint-bits, because it always
does a WAL-logged operation (an update) on every block that in which
it sets hint bits. So very little, if any, extra WAL-logging should
be generated in this case. There are probably cases other than bulk
data load where this is a much bigger problem, although fixing the
bulk data load case would certainly help a lot. At any rate, apart
from the bulk-data-load case, I'm not sure there's anything here that
would be so severe as to make me argue that we ought to reject the
patch entirely, but then again I guess it depends on the obscurity of
the scenario and the size of the penalty. At the very least, we
should try to understand what the performance characteristics are in
problem scenarios, even if we can't or don't choose to do anything
about them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-06 14:30:05 Re: performance-test farm
Previous Message Fujii Masao 2012-03-06 14:20:11 Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)