Skip site navigation (1) Skip section navigation (2)

Re: Block-level CRC checks

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2009-12-01 11:35:42
Message-ID: 200912011135.nB1BZgs15378@momjian.us (view raw or flat)
Thread:
Lists: pgsql-hackers
Simon Riggs wrote:
> The way we handle torn page corruptions *hides* actual corruptions from
> us. The frequency of true positives and false positives is important
> here. If the false positive ratio is very small, then reporting them is
> not a problem because of the benefit we get from having spotted the true
> positives. Some convicted murderers didn't do it, but that is not an
> argument for letting them all go free (without knowing the details). So
> we need to know what the false positive ratio is before we evaluate the
> benefit of either reporting or non-reporting possible corruption events.
> 
> When do you think torn pages happen? Only at crash, or other times also?
> Do they always happen at crash? Are there ways to re-check a block that
> has suffered a hint-related torn page issue? Are there ways to isolate
> and minimise the reporting of false positives? Those are important
> questions and this is not black and white.
> 
> If the *only* answer really is we-must-WAL-log everything, then that is
> the answer, as an option. I suspect that there is a less strict
> possibility, if we question our assumptions and look at the frequencies.
> 
> We know that I have no time to work on this; I am just trying to hold
> open the door to a few possibilities that we have not fully considered
> in a balanced way. And I myself am guilty of having slammed the door
> previously. I encourage development of a way forward based upon a
> balance of utility.

I think the problem boils down to what the user response should be to a
corruption report.  If it is a torn page, it would be corrected and the
user doesn't have to do anything.  If it is something that is not
correctable, then the user has corruption and/or bad hardware. I think
the problem is that the existing proposal can't distinguish between
these two cases so the user has no idea how to respond to the report.

-- 
  Bruce Momjian  <bruce(at)momjian(dot)us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

In response to

Responses

pgsql-hackers by date

Next:From: Robert HaasDate: 2009-12-01 11:50:57
Subject: Re: CommitFest status/management
Previous:From: Tsutomu YamadaDate: 2009-12-01 11:25:56
Subject: [PATCH] Windows x64

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group