Re: Page Checksums

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-18 19:34:03
Message-ID: 4EEE402B.1030807@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18.12.2011 20:44, David Fetter wrote:
> On Sun, Dec 18, 2011 at 12:19:32PM +0200, Heikki Linnakangas wrote:
>> On 18.12.2011 10:54, David Fetter wrote:
>>> On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:
>>>> On 17.12.2011 23:33, David Fetter wrote:
>>>>> If this introduces new failure modes, please detail, and preferably
>>>>> demonstrate, just what those new modes are.
>>>>
>>>> Hint bits, torn pages -> failed CRC. See earlier discussion:
>>>>
>>>> http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php
>>>
>>> The patch requires that full page writes be on in order to obviate
>>> this problem by never reading a torn page.
>>
>> Doesn't help. Hint bit updates are not WAL-logged.
>
> What new failure modes are you envisioning for this case?

Umm, the one explained in the email I linked to... Let me try once more.
For the sake of keeping the example short, imagine that the PostgreSQL
block size is 8 bytes, and the OS block size is 4 bytes. The CRC is 1
byte, and is stored on the first byte of each page.

In the beginning, a page is in the buffer cache, and it looks like this:

AA 12 34 56 78 9A BC DE

AA is the checksum. Now a hint bit on the last byte is set, so that the
page in the shared buffer cache looks like this:

AA 12 34 56 78 9A BC DF

Now PostgreSQL wants to evict the page from the buffer cache, so it
recalculates the CRC. The page in the buffer cache now looks like this:

BB 12 34 56 78 9A BC DF

Now, PostgreSQL writes the page to the OS cache, with the write() system
call. It sits in the OS cache for a few seconds, and then the OS decides
to flush the first 4 bytes, ie. the first OS block, to disk. On disk,
you now have this:

BB 12 34 56 78 9A BC DE

If the server now crashes, before the OS has flushed the second half of
the PostgreSQL page to disk, you have a classic torn page. The updated
CRC made it to disk, but the hint bit did not. The CRC on disk is not
valid, for the rest of the contents of that page on disk.

Without CRCs, that's not a problem because the data is valid whether or
not the hint bit makes it to the disk. It's just a hint, after all. But
when you have a CRC on the page, the CRC is only valid if both the CRC
update *and* the hint bit update makes it to disk, or neither.

So you've just turned an innocent torn page, which PostgreSQL tolerates
just fine, into a block with bad CRC.

> Any way to
> simulate them, even if it's by injecting faults into the source code?

Hmm, it's hard to persuade the OS to suffer a torn page on purpose. What
you could do is split the write() call in mdwrite() into two. First
write the 1st half of the page, then the second. Then you can put a
breakpoint in between the writes, and kill the system before the 2nd
half is written.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2011-12-18 19:42:09 Re: Page Checksums
Previous Message Tom Lane 2011-12-18 18:52:32 Re: Command Triggers