Re: corrupt pages detected by enabling checksums

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-04-04 21:21:19
Message-ID: CAMkU=1xaJYGO+8Wp_Df+f9Qc-HOFn+WSwepoCSxyUC=9iqzy4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 4 April 2013 02:39, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > If by now the first backend has proceeded to PageSetLSN() we are writing
> > different data to disk than the one we computed the checksum of
> > before. Boom.
>
> Right, so nothing else we were doing was wrong, that's why we couldn't
> spot a bug. The problem is that we aren't replaying enough WAL because
> the checksum on the WAL record is broke.
>

This brings up a pretty frightening possibility to me, unrelated to data
checksums. If a bit gets twiddled in the WAL file due to a hardware issue
or a "cosmic ray", and then a crash happens, automatic recovery will stop
early with the failed WAL checksum with an innocuous looking message. The
system will start up but will be invisibly inconsistent, and will proceed
to overwrite that portion of the WAL file which contains the old data (real
data that would have been necessary to reconstruct, once the corruption is
finally realized ) with an end-of-recovery checkpoint record and continue
to chew up real data from there.

I don't know a solution here, though, other than trusting your hardware.
Changing timelines upon ordinary crash recovery, not just media recovery,
seems excessive but also seems to be exactly what timelines were invented
for, right?

Back to the main topic here, Jeff Davis mentioned earlier "You'd still
think this would cause incorrect results, but...". I didn't realize the
significance of that until now. It does produce incorrect query results.
I was just bailing out before detecting them. Once I specify
ignore_checksum_failure=1
my test harness complains bitterly about the data not being consistent with
what the Perl program knows it is supposed to be.

> I missed out on doing that with XLOG_HINT records, so the WAL CRC can
> be incorrect because the data is scanned twice; normally that would be
> OK because we have an exclusive lock on the block, but with hints we
> only have share lock. So what we need to do is take a copy of the
> buffer before we do XLogInsert().
>
> Simple patch to do this attached for discussion. (Not tested).

> We might also do this by modifying the WAL record to take the whole
> block and bypass the BkpBlock mechanism entirely. But that's more work
> and doesn't seem like it would be any cleaner. I figure lets solve the
> problem first then discuss which approach is best.
>

I've tested your patch it and it seems to do the job. It has survived far
longer than unpatched ever did, with neither checksum warnings, nor
complaints of inconsistent data. (I haven't analyzed the code much, just
the results, and leave the discussion of the best approach to others)

Thanks,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-04-04 21:52:32 Re: Clang compiler warning on 9.3 HEAD
Previous Message Tom Lane 2013-04-04 21:11:18 Re: CREATE EXTENSION BLOCKS