Re: Block-level CRC checks

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2009-12-01 16:06:26
Message-ID: 603c8f070912010806n4ee9528fsdf89665016dd5b30@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 1, 2009 at 10:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
>
>> It's not hard to imagine that when a hardware glitch happens
>> causing corruption, it also causes the system to crash. Recalculating
>> the CRCs after crash would mask the corruption.
>
> They are already masked from us, so continuing to mask those errors
> would not put us in a worse position.
>
> If we are saying that 99% of page corruptions are caused at crash time
> because of torn pages on hint bits, then only WAL logging can help us
> find the 1%. I'm not convinced that is an accurate or safe assumption
> and I'd at least like to see LOG entries showing what happened.

It may or may not be true that most page corruptions happen at crash
time, but it's certainly false that they are caused at crash time
*because of torn pages on hint bits*. If only part of a block is
written to disk and the unwritten parts contain hint-bit changes -
that's not corruption. That's design behavior. Any CRC system needs
to avoid complaining about errors when that happens because otherwise
people will think that their database is corrupted and their hardware
is faulty when in reality it is not.

If we could find a way to put the hint bits in the same 512-byte block
as the CRC, that might do it, but I'm not sure whether that is
possible.

Ignoring CRC errors after a crash until we've re-CRC'd the entire
database will certainly eliminate the bogus error reports, but it
seems likely to mask a large percentage of legitimate errors. For
example, suppose that I write 1MB of data out to disk and then don't
access it for a year. During that time the data is corrupted. Then
the system crashes. Upon recovery, since there's no way of knowing
whether hint bits on those pages were being updated at the time of the
crash, so the system re-CRC's the corrupted data and declares it known
good. Six months later, I try to access the data and find out that
it's bad. Sucks to be me.

Now consider the following alternative scenario: I write the block to
disk. Five minutes later, without an intervening crash, I read it
back in and it's bad. Yeah, the system detects it.

Which is more likely? I'm not an expert on disk failure modes, but my
intuition is that the first one will happen often enough to make us
look silly. Is it 10%? 20%? 50%? I don't know. But ISTM that a
CRC system that has no ability to determine whether a system is still
"ok" post-crash is not a compelling proposition, even though it might
still be able to detect some problems.

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-12-01 16:19:34 Re: Application name patch - v4
Previous Message Tom Lane 2009-12-01 15:55:54 Re: Block-level CRC checks