Re: Invalid headers and xlog flush failures

From: Bricklen Anderson <BAnderson(at)PresiNET(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Invalid headers and xlog flush failures
Date: 2005-02-04 15:14:14
Message-ID: 42039146.7020808@PresiNET.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote:
> Bricklen Anderson <BAnderson(at)PresiNET(dot)com> writes:
>
>>>Tom Lane wrote:
>>>
>>>>But anyway, the evidence seems pretty clear that in fact end of WAL is
>>>>in the 73 range, and so those page LSNs with 972 and 973 have to be
>>>>bogus. I'm back to thinking about dropped bits in RAM or on disk.
>
>
>>memtest86+ ran for over 15 hours with no errors reported.
>>e2fsck -c completed with no errors reported.
>
>
> Hmm ... that's not proof your hardware is ok, but it at least puts the
> ball back in play.
>
>
>>Any ideas on what I should try next? Considering that this db is not
>>in production yet, I _do_ have the liberty to rebuild the database if
>>necessary. Do you have any further recommendations?
>
>
> If the database isn't too large, I'd suggest saving aside a physical
> copy (eg, cp or tar dump taken with postmaster stopped) for forensic
> purposes, and then rebuilding so you can get on with your own work.
>
> One bit of investigation that might be worth doing is to look at every
> single 8K page in the database files and collect information about the
> LSN fields, which are the first 8 bytes of each page.
Do you mean this line from pg_filedump's results:

LSN: logid 56 recoff 0x3f4be440 Special 8176 (0x1ff0)

If so, I've set up a shell script that looped all of the files and emitted that line.
It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line.
I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I
believe is the hex of 74?).

In case anyone wants the script that I ran to get the LSN:
#!/bin/sh

for FILE in /var/postgres/data/base/17235/*; do
i=0
echo $FILE >> test_file;
while [ 1==1 ]; do
str=`pg_filedump -R $i $FILE | grep LSN`;
if [ "$?" -eq "1" ]; then
break
fi
echo "$FILE: $str" >> LSN_out;
i=$((i+1));
done
done

> In a non-broken database all of these should be less than or equal to the current ending
> WAL offset (which you can get with pg_controldata if the postmaster is
> stopped). We know there are at least two bad pages, but are there more?
> Is there any pattern to the bad LSN values? Also it would be useful to
> look at each bad page in some detail to see if there's any evidence of
> corruption extending beyond the LSN value.
>
> regards, tom lane

NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig
into it further.

Thanks again for you help, Tom.

Cheers,

Bricklen

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Eric Jain 2005-02-04 15:17:14 Re: Postgres using up all my memory
Previous Message George Essig 2005-02-04 15:11:09 Re: Problem resolved (tsearch2 inhibiting migration)