Re: Block-level CRC checks

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2009-12-01 19:19:39
Message-ID: 4B156C4B.9000905@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

All,

I feel strongly that we should be verifying pages on write, or at least
providing the option to do so, because hardware is simply not reliable.
And a lot of our biggest users are having issues; it seems pretty much
guarenteed that if you have more than 20 postgres servers, at least one
of them will have bad memory, bad RAID and/or a bad driver.

(and yes, InnoDB, DB2 and Oracle do provide tools to detect hardware
corruption when it happens. Oracle even provides correction tools. We
are *way* behind them in this regard)

There are two primary conditions we are testing for:

(a) bad RAM, which happens as frequently as 8% of the time on commodity
servers, and given a sufficient amount of RAM happens 99% of the time
due to quantum effects, and
(b) bad I/O, in the form of bad drivers, bad RAID, and/or bad disks.

Our users want to potentially take two degrees of action on this:

1. detect the corruption immediately when it happens, so that they can
effectively troubleshoot the cause of the corruption, and potentially
shut down the database before further corruption occurs and while they
still have clean backups.

2. make an attempt to fix the corrupted page before/immediately after it
is written.

Further, based on talking to some of these users who are having chronic
and not-debuggable issues on their sets of 100's of PostgreSQL servers,
there are some other specs:

-- Many users would be willing to sacrifice significant performance (up
to 20%) as a start-time option in order to be "corruption-proof".
-- Even more users would only be interested in using the anti-corruption
options after they know they have a problem to troubleshoot it, and then
turn the corruption detection back off.

So, based on my conversations with users, what we really want is a
solution which does (1) for both (a) and (b) as a start-time option, and
having siginificant performance overhead for this option is OK.

Now, does block-level CRCs qualify?

The problem I have with CRC checks is that it only detects bad I/O, and
is completely unable to detect data corruption due to bad memory. This
means that really we want a different solution which can detect both bad
RAM and bad I/O, and should only fall back on CRC checks if we're unable
to devise one.

One of the things Simon and I talked about in Japan is that most of the
time, data corruption makes the data page and/or tuple unreadable. So,
checking data format for readable pages and tuples (and index nodes)
both before and after write to disk (the latter would presumably be
handled by the bgwriter and/or checkpointer) would catch a lot of kinds
of corruption before they had a chance to spread.

However, that solution would not detect subtle corruption, like
single-bit-flipping issues caused by quantum errors. Also, it would
require reading back each page as it's written to disk, which is OK for
a bunch of single-row writes, but for bulk data loads a significant problem.

So, what I'm saying is that I think we really want a better solution,
and am throwing this out there to see if anyone is clever enough.

--Josh Berkus

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2009-12-01 19:21:08 Re: enable-thread-safety defaults?
Previous Message Scrappy 2009-12-01 19:19:20 Re: [CORE] EOL for 7.4?