Re: corrupt pages detected by enabling checksums

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-04-03 23:52:41
Message-ID: 20130403235241.GA19178@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
> I've changed the subject from "regression test failed when enabling
> checksum" because I now know they are totally unrelated.
>
> My test case didn't need to depend on archiving being on, and so with a
> simple tweak I rendered the two issues orthogonal.
>
>
> On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
> >
> > > What I would probably really want is the data as it existed after the
> > > crash but before recovery started, but since the postmaster
> > > immediately starts recovery after the crash, I don't know of a good
> > > way to capture this.
> >
> > Can you just turn off the restart_after_crash GUC? I had a chance to
> > look at this, and seeing the block before and after recovery would be
> > nice. I didn't see a log file in the data directory, but it didn't go
> > through recovery, so I assume it already did that.
> >
>
> You don't know that the cluster is in the bad state until after it goes
> through recovery because most crashes recover perfectly fine. So it would
> have to make a side-copy of the cluster after the crash, then recover the
> original and see how things go, then either retain or delete the side-copy.
> Unfortunately my testing harness can't do this at the moment, because the
> perl script storing the consistency info needs to survive over the crash
> and recovery. It might take me awhile to figure out how to make it do
> this.
>
> I have the server log just go to stderr, where it gets mingled together
> with messages from my testing harness. I had not uploaded that file.
>
> Here is a new upload. It contains both a data_dir tarball including xlog,
> and the mingled stderr ("do_new.out")
>
> https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing
>
> The other 3 files in it constitute the harness. It is a quite a mess, with
> some hard-coded paths. The just-posted fix for wal_keep_segments will also
> have to be applied.
>
>
>
> >
> > The block is corrupt as far as I can tell. The first third is written,
> > and the remainder is all zeros. The header looks like this:
> >
>
> Yes, that part is by my design. Why it didn't get fixed from a FPI is not
> by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589

looking at xlog from that time, the first records are:
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint: redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid: 843781
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create: base/16384/835589
rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert: rel 1663/16384/835589; tid 44/56

which is far earlier than the "block 90" the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Flower 2013-04-03 23:56:41 Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)
Previous Message Greg Stark 2013-04-03 23:48:45 Re: Page replacement algorithm in buffer cache