Re: Invalid headers and xlog flush failures

From: Bricklen Anderson <BAnderson(at)PresiNET(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Invalid headers and xlog flush failures
Date: 2005-02-03 18:21:54
Message-ID: 42026BC2.6030601@PresiNET.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Bricklen Anderson wrote:
> Tom Lane wrote:
>
>> Bricklen Anderson <BAnderson(at)PresiNET(dot)com> writes:
>>
>>> Tom Lane wrote:
>>>
>>>> I would have suggested that maybe this represented on-disk data
>>>> corruption, but the appearance of two different but not-too-far-apart
>>>> WAL offsets in two different pages suggests that indeed the end of WAL
>>>> was up around segment 972 or 973 at one time.
>>
>>
>>
>>> Nope, never touched pg_resetxlog.
>>> My pg_xlog list ranges from 000000010000007300000041 to
>>> 0000000100000073000000FE, with no breaks. There are also these:
>>> 000000010000007400000000 to 00000001000000740000000B
>>
>>
>>
>> That seems like rather a lot of files; do you have checkpoint_segments
>> set to a large value, like 100? The pg_controldata dump shows that the
>> latest checkpoint record is in the 73/41 file, so presumably the active
>> end of WAL isn't exceedingly far past that. You've got 200 segments
>> prepared for future activity, which is a bit over the top IMHO.
>>
>> But anyway, the evidence seems pretty clear that in fact end of WAL is
>> in the 73 range, and so those page LSNs with 972 and 973 have to be
>> bogus. I'm back to thinking about dropped bits in RAM or on disk.
>> IIRC these numbers are all hex, so the extra "9" could come from just
>> two bits getting turned on that should not be. Might be time to run
>> memtest86 and/or badblocks.
>>
>> regards, tom lane
>
>
> Yes, checkpoint_segments is set to 100, although I can set that lower if
> you feel that that is more appropriate. Currently, the system receives
> around 5-8 million inserts per day (across 3 primary tables), so I was
> leaning towards the "more is better" philosophy.
>
> We ran e2fsck with badblocks option last week and didn't turn anything
> up, along with a couple of passes with memtest. I will run a full-scale
> memtest and post any interesting results.
>
> I've also read that kill -9 postmaster is "not a good thing". I honestly
> can't vouch for whether or not this may or may not have occurred around
> the time of the initial creation of this database. It's possible, since
> this db started it's life as a development db at 8r3 then was bumped to
> 8r5, then on to 8 final where it has become a dev-final db.
>
> Assuming that the memtest passes cleanly, as does another run of
> badblocks, do you have any more suggestions on how I should proceed?
> Should I run for a while with zero_damaged_pages set to true and accpet
> the data loss, or just recreate the whole db from scratch?
>

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have
the liberty to rebuild the database if necessary. Do you have any further recommendations?

thanks again,

Bricklen

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2005-02-03 18:40:46 Re: Invalid headers and xlog flush failures
Previous Message Pam Eggler 2005-02-03 16:03:34 vacuum ran out of space and now i cant get back into db