Re: Checksums, state of play

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums, state of play
Date: 2012-03-06 17:50:24
Message-ID: 20120306175024.GA1347@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 06, 2012 at 09:25:17AM -0500, Robert Haas wrote:
> > 2. Turning checksums on/off/on/off in rapid succession can cause false
> > positive reports of checksum failure if crashes occur and are ignored.
> > That may lead to the feature and PostgreSQL being held in disrepute.
>
> This I do think is a problem, although not for precisely the reason
> stated here. In my experience, in data corruption situations, the
> first thing customers do is blame PostgreSQL: they don't believe it's
> the hardware; they accuse us of having bugs in our code. Having a
> checksum feature would be valuable, because, first, we'd perhaps
> detect problems sooner and, second, people understand what checksums
> are and that checksum failures really shouldn't happen unless the
> hardware is bad. More generally, one of the purposes of checksums is
> to distinguish hardware failure from other possible causes of data
> corruption problems. If there are code paths where checksum failures
> can happy despite the hardware being good, I think that the patch will
> fail to accomplish its goal of giving us confidence that the hardware
> is bad.

I think the "turning checksums on/off/on/off" is really a killer
problem, and obviously many of the actions needed to make it safe make
the checksum feature itself less useful.

One crazy idea would be to have a checksum _version_ number somewhere on
the page and in pg_controldata. When you turn on checksums, you
increment that value, and all new checksum pages get that checksum
version; if you turn off checksums, we just don't check them anymore,
but they might get incorrect due to a hint bit write and a crash. When
you turn on checksums again, you increment the checksum version again,
and only check pages having the _new_ checksum version.

Yes, this does add additional storage requirements for the checksum, but
I don't see another clean option. If you can spare one byte, that gives
you 255 times to turn on checksums; after that, you have to
dump/reload to use the checksum feature.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-03-06 17:56:13 Re: Checksums, state of play
Previous Message Robert Haas 2012-03-06 17:47:00 Re: elegant and effective way for running jobs inside a database