Skip site navigation (1) Skip section navigation (2)

Re: Race-condition with failed block-write?

From: Arjen van der Meijden <acm(at)tweakers(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 17:43:06
Message-ID: 43270FAA.20301@tweakers.net (view raw or flat)
Thread:
Lists: pgsql-bugs
On 13-9-2005 16:25, Tom Lane wrote:
> Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
> 
> It's highly unlikely that that query has anything to do with it, since
> it's not touching anything but system catalogs and not trying to write
> them either.

Indeed, other things trigger it as well.

> The first thing you ought to find out is which table
> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
> already present on disk in that block.  

Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Using pg_filedump I extracted the LSN for block 21 and indeed, that was 
already 67713428 instead of something below 2E73E53C. It wasn't that 
block alone though, here are a few LSN-lines from it:

  LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
  LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
  LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
  LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

I tried other files and each one I tried only had LSN's of 0.

When trying (\d indexname in psql) to determine to which table that 
index belonged I noticed it got the errors again, but for another file 
(pg_index this time). And another try (oid2name ...) after that, yet 
another file (the pg_class-table). All those files where last changed 
somewhere August 25, so now new changes.

On that day I did some active query-tuning, but a few times it took too 
long, so I issued immediate shut downs when the selects took too long. 
There were no warnings about broken records afterwards in the log 
though, so I don't believe anything got damaged afterwards.

After that I loaded some fresh data from a production-database using 
either pg_restore or psql < some-file-from-pg_dump.sql (I don't know 
which one anymore). A few days later I shut down that postgres, 
installed 8.1-beta and used that (in another directory of course), this 
8.0.3 only came back up because of a reboot and wasn't used since that 
reboot.

I guess, during that reloading those system tables got mixed up?

> If it is, then we've probably
> not got much chance of finding out how it got there.  If it is *not* on
> disk, but you have a repeatable way of causing this to happen starting
> from a clean postmaster start, then that's pretty interesting --- but
> I don't know any way of figuring it out short of groveling through the
> code with a debugger.  If you're not already pretty familiar with the PG
> code, coaching you remotely isn't going to work very well :-(.  I'd be
> glad to look into it if you can get me access to the machine though.

Well, I can very probably give you that access. But as you say, finding 
out was went wrong is very hard to do.

Best regards,

Arjen van der Meijden

In response to

Responses

pgsql-bugs by date

Next:From: Tom LaneDate: 2005-09-13 18:04:06
Subject: Re: Race-condition with failed block-write?
Previous:From: Tom LaneDate: 2005-09-13 16:45:09
Subject: Re: ia64-hp-hpux11.23 configure warnings

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group