Skip site navigation (1) Skip section navigation (2)

Re: Race-condition with failed block-write?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acm(at)tweakers(dot)net>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 18:04:06
Message-ID: 25482.1126634646@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-bugs
Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
> On 13-9-2005 16:25, Tom Lane wrote:
>> The first thing you ought to find out is which table
>> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
>> already present on disk in that block.  

> Well, its an index, not a table. It was the index:
> "pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Ah.  So you've reindexed pg_class at some point.  Reindexing it again
would likely get you out of this.

> Using pg_filedump I extracted the LSN for block 21 and indeed, that was 
> already 67713428 instead of something below 2E73E53C. It wasn't that 
> block alone though, here are a few LSN-lines from it:

>   LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
>   LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
>   LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>   LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

logid is the high-order half of the LSN, so there's nothing wrong with
those other pages --- it's only the first one you show there that seems
to be past the current end of WAL.

> On that day I did some active query-tuning, but a few times it took too 
> long, so I issued immediate shut downs when the selects took too long. 
> There were no warnings about broken records afterwards in the log 
> though, so I don't believe anything got damaged afterwards.

I have a feeling something may have gone wrong here, though it's hard to
say what.  If the bogus pages in the other tables all have LSNs close to
this one then that makes it less likely that this is a random corruption
event --- what would be more plausible is that end of WAL really was
that high and somehow the WAL counter got reset back during one of those
forced restarts.

Can you show us ls -l output for the pg_xlog directory?  I'm interested
to see the file names and mod dates there.

			regards, tom lane

In response to

Responses

pgsql-bugs by date

Next:From: Arjen van der MeijdenDate: 2005-09-13 18:40:25
Subject: Re: Race-condition with failed block-write?
Previous:From: Arjen van der MeijdenDate: 2005-09-13 17:43:06
Subject: Re: Race-condition with failed block-write?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group