Re: Page Checksums

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndQuadrant(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-19 23:14:16
Message-ID: 4EEF70E80200002500043E37@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> wrote:

> But if you need all that infrastructure just to get the feature
> launched, that's a bit hard to stomach.

Triggering a vacuum or some hypothetical "scrubbing" feature?

> Also, as someone who follows Murphy's Law as my chosen religion,

If you don't think I pay attention to Murphy's Law, I should recap
our backup procedures -- which involves three separate forms of
backup, each to multiple servers in different buildings, real-time,
plus idle-time comparison of the databases of origin to all replicas
with reporting of any discrepancies. And off-line "snapshot"
backups on disk at a records center controlled by a different
department. That's in addition to RAID redundancy and hardware
health and performance monitoring. Some people think I border on
the paranoid on this issue.

> I would expect this situation could be exactly how flaky hardware
> would first manifest itself: server crash and a bad CRC on the
> last thing written out. And in that case, the last thing you want
> to do is assume things are fine, then kick off a VACUUM that might
> overwrite more good data with bad.

Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue. And certainly there are way too many people who don't ensure
that they have a good backup before firing up PostgreSQL after a
failure, so I can see not making autovacuum more aggressive than
usual, and perhaps even disabling it until there is some sort of
confirmation (I have no idea how) that a backup has been made. That
said, a database VACUUM would be one of my first steps after
ensuring that I had a copy of the data directory tree, personally.
I guess I could even live with that as recommended procedure rather
than something triggered through autovacuum and not feel that the
rest of my posts on this are too far off track.

> The main way I expect to validate this sort of thing is with an as
> yet unwritten function to grab information about a data block from
> a standby server for this purpose, something like this:
>
> Master: Computed CRC A, Stored CRC B; error raised because A!=B
> Standby: Computed CRC C, Stored CRC D
>
> If C==D && A==C, the corruption is probably overwritten bits of
> the CRC B.

Are you arguing we need *that* infrastructure to get the feature
launched?

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-12-19 23:26:03 Re: Lots of FSM-related fragility in transaction commit
Previous Message Marti Raudsepp 2011-12-19 22:34:45 [PATCH] Fix ScalarArrayOpExpr estimation for GIN indexes