Re: Page Checksums

From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-19 04:44:43
Message-ID: CAC_2qU8nWNrOZ7L-A9ATHNTiARX1wXq3RbnL1NCXg6+6VT7Aqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 18, 2011 at 11:21 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 12/18/11 5:55 PM, Greg Stark wrote:
>> There is another way to look at this problem. Perhaps it's worth
>> having a checksum *even if* there are ways for the checksum to be
>> spuriously wrong. Obviously having an invalid checksum can't be a
>> fatal error then but it might still be useful information. Rright now
>> people don't really know if their system can experience torn pages or
>> not and having some way of detecting them could be useful. And if you
>> have other unexplained symptoms then having checksum errors might be
>> enough evidence that the investigation should start with the hardware
>> and get the sysadmin looking at hardware logs and running memtest
>> sooner.
>
> Frankly, if I had torn pages, even if it was just hint bits missing, I
> would want that to be logged.  That's expected if you crash, but if you
> start seeing bad CRC warnings when you haven't had a crash?  That means
> you have a HW problem.
>
> As long as the CRC checks are by default warnings, then I don't see a
> problem with this; it's certainly better than what we have now.

But the scary part is you don't know how long *ago* the crash was.
Because a hint-bit-only change w/ a torn-page is a "non event" in
PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
and "scrub" every page in the database.

So you could have a crash, then a recovery, and a couple clean
shutdown-restart combinations before you happen to read the "needed"
page that was torn in the crash $X [ days | weeks | months ] ago.
It's specifically because PostgreSQL was *DESIGNED* to make torn pages
a non-event (because WAL/FPW fixes anything that's dangerous), that
the whole CRC issue is so complicated...

I'll through out a few random thoughts (some repeated) that people who
really want the CRC can fight over:

1) Find a way to not bother writing out hint-bit-only-dirty pages....
I know people like Kevin keep recommending a vacuum freeze after a
big load to avoid later problems anyways and I think that's probably
common in big OLAP shops, and OLTP people are likely to have real
changes on the page anyways. Does anybody want to try and measure
what type of performance trade-offs we'ld really have on a variety of
"normal" (ya, I know, what's normal) workloads? If the page has a
real change, it's got a WAL FPW, so we avoid the problem....

2) If the writer/checksummer knows it's a hint-bit-only-dirty page,
can it stuff a "cookie" checksum in it and not bother verifying?
Looses a bit of the CRC guarentee, especially around "crashes" which
is when we expect a torn page, but avoids the whole "scary! scary!
Your database is corrupt!" false-positives in the situation PostgreSQL
was specifically desinged to make not scary.

#) Anybody investigated putting the CRC in a relation fork, but not
right in the data block? If the CRC contains a timestamp, and is WAL
logged before the write, at least on reading a block with a wrong
checksum, if a warning is emitted, the timestamp could be looked at by
whoever is reading the warning and know tht the block was written
shortly before the crash $X $PERIODS ago....

The whole "CRC is only a warning" because we "expect to get them if we
ever crashed" means that the time when we most want them, we have to
assume they are bogus... And to make matters worse, we don't even
know when the perioud of "they may be bugus" ends, unless we have a
way to methodically force PG through ever buffer in the database after
the crash... And then that makes them very hard to consider
useful...

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emanuel Calvo 2011-12-19 09:02:11 Re: REMINDER: FOSDEM 2012 - PostgreSQL Devroom: Call for Speakers
Previous Message Merlin Moncure 2011-12-19 04:41:54 Re: JSON for PG 9.2