Re: Race-condition with failed block-write?

From: Arjen van der Meijden <acm(at)tweakers(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 17:43:06
Message-ID: 43270FAA.20301@tweakers.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 13-9-2005 16:25, Tom Lane wrote:
> Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
>
> It's highly unlikely that that query has anything to do with it, since
> it's not touching anything but system catalogs and not trying to write
> them either.

Indeed, other things trigger it as well.

> The first thing you ought to find out is which table
> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
> already present on disk in that block.

Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Using pg_filedump I extracted the LSN for block 21 and indeed, that was
already 67713428 instead of something below 2E73E53C. It wasn't that
block alone though, here are a few LSN-lines from it:

LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0)
LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0)
LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0)

I tried other files and each one I tried only had LSN's of 0.

When trying (\d indexname in psql) to determine to which table that
index belonged I noticed it got the errors again, but for another file
(pg_index this time). And another try (oid2name ...) after that, yet
another file (the pg_class-table). All those files where last changed
somewhere August 25, so now new changes.

On that day I did some active query-tuning, but a few times it took too
long, so I issued immediate shut downs when the selects took too long.
There were no warnings about broken records afterwards in the log
though, so I don't believe anything got damaged afterwards.

After that I loaded some fresh data from a production-database using
either pg_restore or psql < some-file-from-pg_dump.sql (I don't know
which one anymore). A few days later I shut down that postgres,
installed 8.1-beta and used that (in another directory of course), this
8.0.3 only came back up because of a reboot and wasn't used since that
reboot.

I guess, during that reloading those system tables got mixed up?

> If it is, then we've probably
> not got much chance of finding out how it got there. If it is *not* on
> disk, but you have a repeatable way of causing this to happen starting
> from a clean postmaster start, then that's pretty interesting --- but
> I don't know any way of figuring it out short of groveling through the
> code with a debugger. If you're not already pretty familiar with the PG
> code, coaching you remotely isn't going to work very well :-(. I'd be
> glad to look into it if you can get me access to the machine though.

Well, I can very probably give you that access. But as you say, finding
out was went wrong is very hard to do.

Best regards,

Arjen van der Meijden

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2005-09-13 18:04:06 Re: Race-condition with failed block-write?
Previous Message Tom Lane 2005-09-13 16:45:09 Re: ia64-hp-hpux11.23 configure warnings