Skip site navigation (1) Skip section navigation (2)

Re: UTC4115FATAL: the database system is in recovery mode

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Mathew Samuel <Mathew(dot)Samuel(at)entrust(dot)com>
Cc: "'pgsql-general(at)postgresql(dot)org'" <pgsql-general(at)postgresql(dot)org>
Subject: Re: UTC4115FATAL: the database system is in recovery mode
Date: 2011-05-30 23:51:44
Message-ID: 4DE42D90.6070201@postnewspapers.com.au (view raw or flat)
Thread:
Lists: pgsql-general
On 05/30/2011 10:29 PM, Mathew Samuel wrote:

> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
> it was already committed
> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated
> by signal 6

Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by 
statement_timeout, couldn't be aborted (assuming it was in fact 
1827110275) and then the backend crashed with a signal 6 (SIGABRT). 
SIGABRT can be caused by an assertion failure, certain fatal aborts in 
the C library caused by memory allocation errors, etc.

Alas, while PostgreSQL may have dumped a core file I doubt there's any 
debug information in your build. If you do find a core file for that 
process ID, it might be worth checking for a debuginfo rpm just in case.

> In fact those last 3 lines are repeated over and over again repeatedly
> until "UTC4115FATAL: the database system is in recovery mode" is logged
> for 4 hours. At some point, 4 hours later of course, it appears that the
> system recovers.

Wow. Four hours recovery with default checkpoint settings.

Is it possible that the server was completely overloaded and was 
swapping heavily? That could explain why VACUUM timed out in the first 
place, and would explain why it took such a long time to recover. Check 
your system logs around the same time for other indications of excessive 
load, and check your monitoring history if you have monitoring like 
Cacti or the like active.

See if there's anything interesting in the kernel logs too.

Just for completeness, can you send all non-commented-out, non-blank 
lines in your postgresql.conf ?

$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

--
Craig Ringer

In response to

Responses

pgsql-general by date

Next:From: David JohnstonDate: 2011-05-31 01:04:24
Subject: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters
Previous:From: Craig RingerDate: 2011-05-30 23:35:53
Subject: Re: deadlock problem

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group