Re: Race-condition with failed block-write?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acm(at)tweakers(dot)net>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 18:04:06
Message-ID: 25482.1126634646@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
> On 13-9-2005 16:25, Tom Lane wrote:
>> The first thing you ought to find out is which table
>> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
>> already present on disk in that block.

> Well, its an index, not a table. It was the index:
> "pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Ah. So you've reindexed pg_class at some point. Reindexing it again
would likely get you out of this.

> Using pg_filedump I extracted the LSN for block 21 and indeed, that was
> already 67713428 instead of something below 2E73E53C. It wasn't that
> block alone though, here are a few LSN-lines from it:

> LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0)
> LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0)
> LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
> LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0)
> LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0)
> LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
> LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0)

logid is the high-order half of the LSN, so there's nothing wrong with
those other pages --- it's only the first one you show there that seems
to be past the current end of WAL.

> On that day I did some active query-tuning, but a few times it took too
> long, so I issued immediate shut downs when the selects took too long.
> There were no warnings about broken records afterwards in the log
> though, so I don't believe anything got damaged afterwards.

I have a feeling something may have gone wrong here, though it's hard to
say what. If the bogus pages in the other tables all have LSNs close to
this one then that makes it less likely that this is a random corruption
event --- what would be more plausible is that end of WAL really was
that high and somehow the WAL counter got reset back during one of those
forced restarts.

Can you show us ls -l output for the pg_xlog directory? I'm interested
to see the file names and mod dates there.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Arjen van der Meijden 2005-09-13 18:40:25 Re: Race-condition with failed block-write?
Previous Message Arjen van der Meijden 2005-09-13 17:43:06 Re: Race-condition with failed block-write?