Re: Block-level CRC checks

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2009-12-01 19:10:07
Message-ID: 407d949e0912011110h5b0b126br74eb7d3efc337c63@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> OK, here is another idea, maybe crazy:
>
>> When we read in a page that has an invalid CRC, we check the page to see
>> which hint bits are _not_ set, and we try setting them to see if can get
>> a matching CRC.

Unfortunately you would also have to try *unsetting* every hint bit as
well since the updated hint bits might have made it to disk but not
the CRC leaving the old CRC for the block with the unset bits.

I actually independently had the same thought today that Simon had of
moving the hint bits to the line pointer. We can obtain more free bits
in the line pointers by dividing the item offsets and sizes by
maxalign if we need it. That should give at least 4 spare bits which
is all we need for the four VALID/INVALID hint bits.

It should be relatively cheap to skip the hint bits in the line
pointers since they'll be the same bits of every 16-bit value for a
whole range. Alternatively we could just CRC the tuples and assume a
corrupted line pointer will show itself quickly. That would actually
make it faster than a straight CRC of the whole block -- making
lemonade out of lemons as it were.

There's still the all-tuples-in-page-are-visible hint bit and the hint
bits in btree pages. I'm not sure if those are easier or harder to
solve. We might be able to assume the all-visible flag will not be
torn from the crc as long as they're within the same 512 byte sector.
And iirc the btree hint bits are in the line pointers themselves as
well?

Another thought is that would could use the MSSQL-style torn page
detection of including a counter (or even a bit?) in every 512-byte
chunk which gets incremented every time the page is written. If they
don't all match when read in then the page was torn and we can't check
the CRC. That gets us the advantage that we can inform the user that a
torn page was detected so they know that they must absolutely use
full_page_writes on their system. Currently users are in the dark
whether their system is susceptible to them or not and have now idea
with what frequency. Even here there are quite divergent opinions
about their frequency and which systems are susceptible to them or
immune.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2009-12-01 19:14:54 Re: Deleted WAL files held open by backends in Linux
Previous Message Tom Lane 2009-12-01 19:07:22 Re: Re: [COMMITTERS] pgsql: Rewrite GEQO`s gimme_tree function so that it always finds a