Skip site navigation (1) Skip section navigation (2)

Re: Race-condition with failed block-write?

From: Arjen van der Meijden <acm(at)tweakers(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 20:12:18
Message-ID: 432732A2.9060300@tweakers.net (view raw or flat)
Thread:
Lists: pgsql-bugs
On 13-9-2005 20:53, Tom Lane wrote:
> Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
> 
>>Unless reindexing is part of other commands, I didn't do that. The last 
>>time 'grep' was able to find an reference to something being reindexed 
>>was in June, something (maybe me, but I doubt it, I'd also reindex the 
>>user-tables, I suppose) was reindexing all system tables back then.
> 
> 
> Well, that would have done it.  But if you have such complete logs, do
> you have the postmaster log from the whole period?  
I have logs from the time I started using it seriously indeed. It was 
just never rotated or something. I did adjust it to have proper prefixes 
only at 23th of august, so everything before that is a bit less easy to 
read.
Besides that it contains lots of "debug" stuff, like logged queries and 
their statistics.

It is a lot of text and the file is currently 52MB. But without all 
those repeated failure-messages its about 862KB. So I can e-mail a 
compressed version to you, if you're interested in it.

> Look for sequences
> approximately like this:
> 
> and note whether there's any one that shows a WAL end address (here
> 0/12B65A88) that's less than the one before.

I take it you mean something like this:
[ - 2005-08-23 15:29:26 CEST @] LOG:  checkpoint record is at 29/2E73E410
[ - 2005-08-23 15:37:27 CEST @] LOG:  checkpoint record is at 29/2E73E410
[ - 2005-08-25 10:26:27 CEST @] LOG:  checkpoint record is at 29/2E8D26B4
[ - 2005-08-25 10:47:59 CEST @] LOG:  checkpoint record is at 29/2E8D26F0
[ - 2005-09-05 11:24:08 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-09 10:27:20 CEST @] LOG:  checkpoint record is at 29/2E73E44C

The last start-up at 25th of August has a higher second value than the 
start-up at the 5th of September. Which was a start since it was closed 
at sep 1.

>>l /var/lib/postgresql/data/pg_xlog/
>>total 145M
>>drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
>>drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
>>-rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
>>-rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E
> 
> 
> This is definitely ugly: there's no way all those higher-numbered
> segments would be in place if the WAL counter hadn't been up to that
> range.  So I think we are definitely looking at a WAL-counter-went-
> backward problem, not some random data corruption.
> 
> Judging from the file dates, the WAL counter was still OK (ie, above
> 67xxx) on Sep 1, and the problem must have occurred since then.

That's odd. But it must have been my hard attempts to shut it down then. 
These are all the lines there are for Sep 1:

[ - 2005-09-01 12:36:50 CEST @] LOG:  received fast shutdown request
[ - 2005-09-01 12:36:50 CEST @] LOG:  shutting down
[ - 2005-09-01 12:36:50 CEST @] LOG:  database system is shut down
[ - 2005-09-01 12:37:01 CEST @] LOG:  received smart shutdown request
[ - 2005-09-01 12:37:01 CEST @] LOG:  shutting down
[ - 2005-09-01 12:37:01 CEST @] LOG:  database system is shut down
[ - 2005-09-05 11:24:08 CEST @] LOG:  database system was shut down at 
2005-09-01 12:37:01 CEST
[ - 2005-09-05 11:24:08 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-05 11:24:08 CEST @] LOG:  redo record is at 29/2E73E44C; 
undo record is at 0/0; shutdown TRUE
[ - 2005-09-05 11:24:08 CEST @] LOG:  next transaction ID: 166361; next 
OID: 55768251
[ - 2005-09-05 11:24:08 CEST @] LOG:  database system is ready

I'm not sure why I needed two shut down-calls at that time. It may have 
been due to a psql on a terminal I'd forgot to close first.

Best regards,

Arjen

In response to

Responses

pgsql-bugs by date

Next:From: Tom LaneDate: 2005-09-13 21:01:37
Subject: Re: Race-condition with failed block-write?
Previous:From: Tom LaneDate: 2005-09-13 18:53:11
Subject: Re: Race-condition with failed block-write?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group