Skip site navigation (1) Skip section navigation (2)

Re: regression test failed when enabling checksum

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>,PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: regression test failed when enabling checksum
Date: 2013-04-03 09:31:21
Message-ID: 20130403093121.GB4682@awork2.anarazel.de (view raw or flat)
Thread:
Lists: pgsql-hackers
On 2013-04-01 19:51:19 -0700, Jeff Janes wrote:
> On Mon, Apr 1, 2013 at 10:37 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> 
> > On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> >>
> >> Patch attached. Only brief testing done, so I might have missed
> >> something. I will look more closely later.
> >>
> >
> > After applying your patch, I could run the stress test described here:
> >
> > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php
> >
> > But altered to make use of initdb -k, of course.
> >
> > Over 10,000 cycles of crash and recovery, I encountered two cases of
> > checksum failures after recovery, example:
> > ...
> >
> 
> 
> > Unfortunately I already cleaned up the data directory before noticing the
> > problem, so I have nothing to post for forensic analysis.  I'll try to
> > reproduce the problem.
> >
> >
> I've reproduced the problem, this time in block 74 of relation
> base/16384/4931589, and a tarball of the data directory is here:
> 
> https://docs.google.com/file/d/0Bzqrh1SO9FcELS1majlFcTZsR0k/edit?usp=sharing
> 
> (the table is in database jjanes under role jjanes, the binary is commit
> 9ad27c215362df436f8c)
> 
> What I would probably really want is the data as it existed after the crash
> but before recovery started, but since the postmaster immediately starts
> recovery after the crash, I don't know of a good way to capture this.
> 
> I guess one thing to do would be to extract from the WAL the most recent
> FPW for block 74 of relation base/16384/4931589  (assuming it hasn't been
> recycled already) and see if it matches what is actually in that block of
> that data file, but I don't currently know how to do that.
> 
> 11500 SELECT 2013-04-01 12:01:56.926 PDT:WARNING:  page verification
> failed, calculated checksum 54570 but expected 34212
> 11500 SELECT 2013-04-01 12:01:56.926 PDT:ERROR:  invalid page in block 74
> of relation base/16384/4931589
> 11500 SELECT 2013-04-01 12:01:56.926 PDT:STATEMENT:  select sum(count) from
> foo

I just checked and unfortunately your dump doesn't contain all that much
valid WAL:
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 7/AB000028, prev 7/AA000090, bkp: 0000, desc: checkpoint: redo 7/AB000028; tli 1; prev tli 1; fpw true; xid 0/156747297; oid 4939781; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 7/AB000090, prev 7/AB000028, bkp: 0000, desc: checkpoint: redo 7/AB000090; tli 1; prev tli 1; fpw true; xid 0/156747297; oid 4939781; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
pg_xlogdump: FATAL:  error in WAL record at 7/AB000090: record with zero length at 7/AB0000F8

So just two checkpoint records.

Unfortunately I  fear that won't be enough to diagnose the problem,
could you reproduce it with a higher wal_keep_segments?

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


In response to

Responses

pgsql-hackers by date

Next:From: Albe LaurenzDate: 2013-04-03 09:37:46
Subject: Re: CREATE EXTENSION BLOCKS
Previous:From: Alexander KorotkovDate: 2013-04-03 09:18:27
Subject: Re: WIP: index support for regexp search

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group