corrupt pages detected by enabling checksums

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: corrupt pages detected by enabling checksums
Date: 2013-04-03 22:57:49
Message-ID: CAMkU=1yTvoc5D2MzL2KcWxm_vS-kbN6SY_WVHCJVZKOaQ-MB2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've changed the subject from "regression test failed when enabling
checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a
simple tweak I rendered the two issues orthogonal.

On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
>
> > What I would probably really want is the data as it existed after the
> > crash but before recovery started, but since the postmaster
> > immediately starts recovery after the crash, I don't know of a good
> > way to capture this.
>
> Can you just turn off the restart_after_crash GUC? I had a chance to
> look at this, and seeing the block before and after recovery would be
> nice. I didn't see a log file in the data directory, but it didn't go
> through recovery, so I assume it already did that.
>

You don't know that the cluster is in the bad state until after it goes
through recovery because most crashes recover perfectly fine. So it would
have to make a side-copy of the cluster after the crash, then recover the
original and see how things go, then either retain or delete the side-copy.
Unfortunately my testing harness can't do this at the moment, because the
perl script storing the consistency info needs to survive over the crash
and recovery. It might take me awhile to figure out how to make it do
this.

I have the server log just go to stderr, where it gets mingled together
with messages from my testing harness. I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog,
and the mingled stderr ("do_new.out")

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing

The other 3 files in it constitute the harness. It is a quite a mess, with
some hard-coded paths. The just-posted fix for wal_keep_segments will also
have to be applied.

>
> The block is corrupt as far as I can tell. The first third is written,
> and the remainder is all zeros. The header looks like this:
>

Yes, that part is by my design. Why it didn't get fixed from a FPI is not
by my design, or course.

>
> So, the page may be corrupt without checksums as well, but it just
> happens to be hidden for the same reason. Can you try to reproduce it
> without -k?

No, things run (seemingly) fine without -k.

> And on the checkin right before checksums were added?
> Without checksums, you'll need to use pg_filedump (or similar) to find
> whether an error has happened.
>

I'll work on it, but it will take awhile to figure out exactly how to use
pg_filedump to do that, and how to automate that process and incorporate it
into the harness.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and the
problem occurs there, so if the problem is not the checksums themselves
(and I agree it probably isn't) it must have been introduced before that
rather than after.

Cheers,

Jeff

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2013-04-03 23:19:09 Re: Clang compiler warning on 9.3 HEAD
Previous Message Jim Nasby 2013-04-03 22:55:36 Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)