Re: Race-condition with failed block-write?

From: Arjen van der Meijden <acm(at)tweakers(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 18:40:25
Message-ID: 43271D19.7030701@tweakers.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 13-9-2005 20:04, Tom Lane wrote:
> Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
>
>>On 13-9-2005 16:25, Tom Lane wrote:
>>
>>Well, its an index, not a table. It was the index:
>>"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).
>
> Ah. So you've reindexed pg_class at some point. Reindexing it again
> would likely get you out of this.

Unless reindexing is part of other commands, I didn't do that. The last
time 'grep' was able to find an reference to something being reindexed
was in June, something (maybe me, but I doubt it, I'd also reindex the
user-tables, I suppose) was reindexing all system tables back then.
Besides, its not just the index, on pg_class, pg_class itself (and
pg_index) have wrong LSN's as well.

>>Using pg_filedump I extracted the LSN for block 21 and indeed, that was
>>already 67713428 instead of something below 2E73E53C. It wasn't that
>>block alone though, here are a few LSN-lines from it:
>
>
>> LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0)
>> LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0)
>> LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
>> LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0)
>> LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0)
>> LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
>> LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0)
>
>
> logid is the high-order half of the LSN, so there's nothing wrong with
> those other pages --- it's only the first one you show there that seems
> to be past the current end of WAL.

There were 3 blocks of 40 with a LSN like the first one above in that
index-file. So with high-order 41, recoff 0x67[67]something.
In the pg_class-file there were 6 blocks, of which 5 LSN's were like the
above in that index. And for pg_index 3 blocks, with 1 wrong.

>>On that day I did some active query-tuning, but a few times it took too
>>long, so I issued immediate shut downs when the selects took too long.
>>There were no warnings about broken records afterwards in the log
>>though, so I don't believe anything got damaged afterwards.
>
> I have a feeling something may have gone wrong here, though it's hard to
> say what. If the bogus pages in the other tables all have LSNs close to
> this one then that makes it less likely that this is a random corruption
> event --- what would be more plausible is that end of WAL really was
> that high and somehow the WAL counter got reset back during one of those
> forced restarts.
>
> Can you show us ls -l output for the pg_xlog directory? I'm interested
> to see the file names and mod dates there.

Here you go:

l /var/lib/postgresql/data/pg_xlog/
total 145M
drwx------ 3 postgres postgres 4.0K Sep 1 12:37 .
drwx------ 8 postgres postgres 4.0K Sep 13 20:31 ..
-rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E
-rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067
-rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068
-rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E

During data-load it was warning about too frequent checkpoints, but I do
hope thats mostly performance-related, not stability?

Best regards,

Arjen van der Meijden

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2005-09-13 18:53:11 Re: Race-condition with failed block-write?
Previous Message Tom Lane 2005-09-13 18:04:06 Re: Race-condition with failed block-write?