Re: Race-condition with failed block-write?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acm(at)tweakers(dot)net>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 18:53:11
Message-ID: 25897.1126637591@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
> Unless reindexing is part of other commands, I didn't do that. The last
> time 'grep' was able to find an reference to something being reindexed
> was in June, something (maybe me, but I doubt it, I'd also reindex the
> user-tables, I suppose) was reindexing all system tables back then.

Well, that would have done it. But if you have such complete logs, do
you have the postmaster log from the whole period? Look for sequences
approximately like this:

2005-09-13 14:48:33.450 EDT 25857 LOG: database system was interrupted at 2005-09-13 14:42:08 EDT
2005-09-13 14:48:33.450 EDT 25857 LOG: checkpoint record is at 0/12B65A40
2005-09-13 14:48:33.450 EDT 25857 LOG: redo record is at 0/12B65A40; undo record is at 0/0; shutdown FALSE
2005-09-13 14:48:33.450 EDT 25857 LOG: next transaction ID: 67789; next OID: 240494
2005-09-13 14:48:33.450 EDT 25857 LOG: next MultiXactId: 1; next MultiXactOffset: 0
2005-09-13 14:48:33.450 EDT 25857 LOG: database system was not properly shut down; automatic recovery in progress
2005-09-13 14:48:33.452 EDT 25857 LOG: record with zero length at 0/12B65A88
2005-09-13 14:48:33.452 EDT 25857 LOG: redo is not required
2005-09-13 14:48:33.474 EDT 25857 LOG: database system is ready

and note whether there's any one that shows a WAL end address (here
0/12B65A88) that's less than the one before.

> l /var/lib/postgresql/data/pg_xlog/
> total 145M
> drwx------ 3 postgres postgres 4.0K Sep 1 12:37 .
> drwx------ 8 postgres postgres 4.0K Sep 13 20:31 ..
> -rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E
> -rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067
> -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068
> -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069
> -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A
> -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B
> -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C
> -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D
> -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E

This is definitely ugly: there's no way all those higher-numbered
segments would be in place if the WAL counter hadn't been up to that
range. So I think we are definitely looking at a WAL-counter-went-
backward problem, not some random data corruption.

Judging from the file dates, the WAL counter was still OK (ie, above
67xxx) on Sep 1, and the problem must have occurred since then.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Arjen van der Meijden 2005-09-13 20:12:18 Re: Race-condition with failed block-write?
Previous Message Arjen van der Meijden 2005-09-13 18:40:25 Re: Race-condition with failed block-write?