Re: Block-level CRC checks

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2008-11-17 08:26:08
Message-ID: 49212AA0.9060402@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote:
>> In fact, if the patch were to break torn-page handling, it would be
>> 100% likely to be a net *decrease* in system reliability. It would add
>> detection of a situation that is not supposed to happen (ie, storage
>> system fails to return the same data it stored) at the cost of breaking
>> one's database when the storage system acts as it's expected and
>> documented to in a routine power-loss situation.
>
> Ok, I see it's a problem because the hint changes are not WAL logged,
> so torn pages are expected to work in normal operation. But simply
> skipping the hint bits during checksumming is a terrible solution,
> since then any errors in those bits will go undetected. To not be able
> to say in the documentation that you'll detect 100% of single-bit
> errors is pretty darn terrible, since that's kind of the goal of the
> exercise.

Agreed, trying to explain that in the documentation would look like
making excuses.

The requirement that all hint bit changes are WAL-logged seems like a
pretty big change. I don't like doing that, just for CRCing.

There has been discussion before about not writing out pages to disk
that only have hint-bit updates on them. That means that the next time
the page is read, the reader needs to do the clog lookups and set the
hint bits again. It's a tradeoff, making the first SELECT after
modifying a page cheaper, I/O-wise, at the cost of making all subsequent
SELECTs that need to read the page from disk or kernel cache more
expensive, CPU-wise.

I'm not sure if I like that idea or not, but it would also solve the CRC
problem with torn pages. FWIW, it would also solve the problem suggested
with IBM DTLA disks and others that might zero-out a sector in case of
an interrupted write. I'm not totally convinced that's a problem, as
there's apparently other software that make the same assumption as we
do, and we haven't heard of any torn-page corruption in real life, but
still.

If we made the behavior configurable, that would be pretty hard to
explain in the docs. We'd have three options with dependencies

- CRC on/off
- write pages with only hint bit changes on/off
- full_page_writes on/off

If disable full_page_writes, you're vulnerable to torn pages. If you
enable it, you're not. Except if you also turn CRC on. Except if you
also turn "write pages with only hint bit changes" off.

> Unfortunatly, there's not a lot of easy solutions here. You could do
> two checksums, one with and one without hint bits. The overall checksum
> tells you if there's a problem. If it doesn't match the second checksum
> will tell you if it's the hint bits or not (torn page problem). If it's
> the hint bits you can reset them all and continue. The checksums need
> not be of equal strength.

Hmm, that would work I guess.

> The extreme case is an ECC where you explicitly can set it so you can
> alter N bits before you need to recalculate the checksum.
> Computationally though, that sucks.

Yep. Also, in case of a torn page, you're very likely going to have
several hint bits from the old image and several from the new image. An
error-correcting code would need to be unfeasibly long to cope with that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bramandia Ramadhana 2008-11-17 08:47:18 Re: Stack trace
Previous Message Magnus Hagander 2008-11-17 08:01:02 Re: Client certificate authentication