Re: Block-level CRC checks

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql(at)mohawksoft(dot)com, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Decibel! <decibel(at)decibel(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2008-10-01 17:25:52
Message-ID: 871vz01b33.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:

> * Gregory Stark <stark(at)enterprisedb(dot)com> [081001 11:59]:
>
>> If setting a hint bit cleared a flag on the buffer header then the
>> checksumming process could set that flag, begin checksumming, and check that
>> the flag is still set when he's finished.
>>
>> Actually I suppose that wouldn't actually be good enough. He would have to do
>> the i/o and check that the checksum was still valid after the i/o. If not then
>> he would have to recalculate the checksum and repeat the i/o. That might make
>> the idea a loser since I think the only way it wins is if you rarely actually
>> get someone setting the hint bits during i/o anyways.
>
> A doubled-write is essentially "free" with PostgreSQL because it's not
> doing direct IO, rather relying on the OS page cache to be efficient.

All things are relative. What we're talking about here is all cpu and
memory-bandwidth costs anyways so, yes, it'll be cheap compared to the disk
i/o but it'll still represent doubling the memory bandwidth and cpu cost of
these routines.

That said you would only have to do it in cases where the hint bits actually
get twiddled. That might not actually happen often.

> But the problem is if something crashes (or interrupts PG) between those
> two writes, you've got a block of data into the pagecache (and possibly
> to the disks) that PG will no longer read in, because the CRC/checksum
> fails despite the actual content being valid...

I don't think this is a problem because we're still doing WAL logging. The i/o
isn't allowed to happen until the page has been WAL logged and fsynced
anyways.

Incidentally I think the JUST_DIRTIED bit might actually be sufficient here.
Hint bits already cause the buffer to be marked dirty. So the only case I see
a real problem for is when we're writing a block as part of a checkpoint and
find it's JUST_DIRTIED after writing it. In that case we would have to start
over and write it again rather than leave it marked dirty.

If we're writing the block as part of normal i/o then we could just decide to
leave the possibly-bogus checksum in the table since it'll be overwritten by a
full page write anyways. It'll be overwritten in normal use when the newly
dirty buffer is eventually written out again.

If you're not doing full page writes then you would have to restore from
backup in cases where previously the page might actually have been valid
though. That's kind of unfortunate. In theory it hasn't actually changed
anything the risks of running without full page writes but it has certainly
increased the likelihood of actually having to deal with "corruption" in the
form of a gratuitously invalid checksum. (Of course without checksums you
don't ever actually know if you have corruption -- and real corruption).

> One possibility would be to "double-buffer" the write... i.e. as you
> calculate your CRC, you're doing it on a local copy of the block, which
> you hand to the OS to write... If you're touching the whole block of
> memory to CRC it, it isn't *ridiculously* more expensive to copy the
> memory somewhere else as you do it...

Hm. Well that might actually work. You can do the CRC at the same time as
copying to the buffer, effectively doing it for the same cost as the CRC
alone.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-10-01 17:32:58 Re: Block-level CRC checks
Previous Message Mark Mielke 2008-10-01 17:07:22 Re: Block-level CRC checks