Re: Block-level CRC checks

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2009-12-01 11:35:42
Message-ID: 200912011135.nB1BZgs15378@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> The way we handle torn page corruptions *hides* actual corruptions from
> us. The frequency of true positives and false positives is important
> here. If the false positive ratio is very small, then reporting them is
> not a problem because of the benefit we get from having spotted the true
> positives. Some convicted murderers didn't do it, but that is not an
> argument for letting them all go free (without knowing the details). So
> we need to know what the false positive ratio is before we evaluate the
> benefit of either reporting or non-reporting possible corruption events.
>
> When do you think torn pages happen? Only at crash, or other times also?
> Do they always happen at crash? Are there ways to re-check a block that
> has suffered a hint-related torn page issue? Are there ways to isolate
> and minimise the reporting of false positives? Those are important
> questions and this is not black and white.
>
> If the *only* answer really is we-must-WAL-log everything, then that is
> the answer, as an option. I suspect that there is a less strict
> possibility, if we question our assumptions and look at the frequencies.
>
> We know that I have no time to work on this; I am just trying to hold
> open the door to a few possibilities that we have not fully considered
> in a balanced way. And I myself am guilty of having slammed the door
> previously. I encourage development of a way forward based upon a
> balance of utility.

I think the problem boils down to what the user response should be to a
corruption report. If it is a torn page, it would be corrected and the
user doesn't have to do anything. If it is something that is not
correctable, then the user has corruption and/or bad hardware. I think
the problem is that the existing proposal can't distinguish between
these two cases so the user has no idea how to respond to the report.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-12-01 11:50:57 Re: CommitFest status/management
Previous Message Tsutomu Yamada 2009-12-01 11:25:56 [PATCH] Windows x64