Re: Race-condition with failed block-write?

From: Arjen van der Meijden <acm(at)tweakers(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-13 20:12:18
Message-ID: 432732A2.9060300@tweakers.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 13-9-2005 20:53, Tom Lane wrote:
> Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
>
>>Unless reindexing is part of other commands, I didn't do that. The last
>>time 'grep' was able to find an reference to something being reindexed
>>was in June, something (maybe me, but I doubt it, I'd also reindex the
>>user-tables, I suppose) was reindexing all system tables back then.
>
>
> Well, that would have done it. But if you have such complete logs, do
> you have the postmaster log from the whole period?
I have logs from the time I started using it seriously indeed. It was
just never rotated or something. I did adjust it to have proper prefixes
only at 23th of august, so everything before that is a bit less easy to
read.
Besides that it contains lots of "debug" stuff, like logged queries and
their statistics.

It is a lot of text and the file is currently 52MB. But without all
those repeated failure-messages its about 862KB. So I can e-mail a
compressed version to you, if you're interested in it.

> Look for sequences
> approximately like this:
>
> and note whether there's any one that shows a WAL end address (here
> 0/12B65A88) that's less than the one before.

I take it you mean something like this:
[ - 2005-08-23 15:29:26 CEST @] LOG: checkpoint record is at 29/2E73E410
[ - 2005-08-23 15:37:27 CEST @] LOG: checkpoint record is at 29/2E73E410
[ - 2005-08-25 10:26:27 CEST @] LOG: checkpoint record is at 29/2E8D26B4
[ - 2005-08-25 10:47:59 CEST @] LOG: checkpoint record is at 29/2E8D26F0
[ - 2005-09-05 11:24:08 CEST @] LOG: checkpoint record is at 29/2E73E44C
[ - 2005-09-09 10:27:20 CEST @] LOG: checkpoint record is at 29/2E73E44C

The last start-up at 25th of August has a higher second value than the
start-up at the 5th of September. Which was a start since it was closed
at sep 1.

>>l /var/lib/postgresql/data/pg_xlog/
>>total 145M
>>drwx------ 3 postgres postgres 4.0K Sep 1 12:37 .
>>drwx------ 8 postgres postgres 4.0K Sep 13 20:31 ..
>>-rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E
>>-rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D
>>-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E
>
>
> This is definitely ugly: there's no way all those higher-numbered
> segments would be in place if the WAL counter hadn't been up to that
> range. So I think we are definitely looking at a WAL-counter-went-
> backward problem, not some random data corruption.
>
> Judging from the file dates, the WAL counter was still OK (ie, above
> 67xxx) on Sep 1, and the problem must have occurred since then.

That's odd. But it must have been my hard attempts to shut it down then.
These are all the lines there are for Sep 1:

[ - 2005-09-01 12:36:50 CEST @] LOG: received fast shutdown request
[ - 2005-09-01 12:36:50 CEST @] LOG: shutting down
[ - 2005-09-01 12:36:50 CEST @] LOG: database system is shut down
[ - 2005-09-01 12:37:01 CEST @] LOG: received smart shutdown request
[ - 2005-09-01 12:37:01 CEST @] LOG: shutting down
[ - 2005-09-01 12:37:01 CEST @] LOG: database system is shut down
[ - 2005-09-05 11:24:08 CEST @] LOG: database system was shut down at
2005-09-01 12:37:01 CEST
[ - 2005-09-05 11:24:08 CEST @] LOG: checkpoint record is at 29/2E73E44C
[ - 2005-09-05 11:24:08 CEST @] LOG: redo record is at 29/2E73E44C;
undo record is at 0/0; shutdown TRUE
[ - 2005-09-05 11:24:08 CEST @] LOG: next transaction ID: 166361; next
OID: 55768251
[ - 2005-09-05 11:24:08 CEST @] LOG: database system is ready

I'm not sure why I needed two shut down-calls at that time. It may have
been due to a psql on a terminal I'd forgot to close first.

Best regards,

Arjen

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2005-09-13 21:01:37 Re: Race-condition with failed block-write?
Previous Message Tom Lane 2005-09-13 18:53:11 Re: Race-condition with failed block-write?