Skip site navigation (1) Skip section navigation (2)

Re: Block-level CRC checks

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2009-12-01 19:39:43
Message-ID: 4B151C9F020000250002CEA1@gw.wicourts.gov (view raw or flat)
Thread:
Lists: pgsql-hackers
Josh Berkus <josh(at)agliodbs(dot)com> wrote:
 
> And a lot of our biggest users are having issues; it seems pretty
> much guarenteed that if you have more than 20 postgres servers, at
> least one of them will have bad memory, bad RAID and/or a bad
> driver.
 
Huh?!?  We have about 200 clusters running on about 100 boxes, and
we see that very rarely.  On about 100 older boxes, relegated to
less critical tasks, we see a failure maybe three or four times per
year.  It's usually not subtle, and a sane backup and redundant
server policy has kept us from suffering much pain from these.  I'm
not questioning the value of adding features to detect corruption,
but your numbers are hard to believe.
 
> The problem I have with CRC checks is that it only detects bad
> I/O, and is completely unable to detect data corruption due to bad
> memory. This means that really we want a different solution which
> can detect both bad RAM and bad I/O, and should only fall back on
> CRC checks if we're unable to devise one.
 
md5sum of each tuple?  As an optional system column (a la oid)?
 
> checking data format for readable pages and tuples (and index
> nodes) both before and after write to disk
 
Given that PostgreSQL goes through the OS, and many of us are using
RAID controllers with BBU RAM, how do you do a read with any
confidence that it came from the disk?  (I mean, I know how to do
that for a performance test, but as a routine step during production
use?)
 
-Kevin

In response to

Responses

pgsql-hackers by date

Next:From: Greg StarkDate: 2009-12-01 19:41:57
Subject: Re: Block-level CRC checks
Previous:From: Tom LaneDate: 2009-12-01 19:35:32
Subject: Re: [CORE] EOL for 7.4?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group