Re: UTC4115FATAL: the database system is in recovery mode

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Mathew Samuel <Mathew(dot)Samuel(at)entrust(dot)com>
Cc: "'pgsql-general(at)postgresql(dot)org'" <pgsql-general(at)postgresql(dot)org>
Subject: Re: UTC4115FATAL: the database system is in recovery mode
Date: 2011-05-30 23:51:44
Message-ID: 4DE42D90.6070201@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 05/30/2011 10:29 PM, Mathew Samuel wrote:

> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
> it was already committed
> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated
> by signal 6

Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by
statement_timeout, couldn't be aborted (assuming it was in fact
1827110275) and then the backend crashed with a signal 6 (SIGABRT).
SIGABRT can be caused by an assertion failure, certain fatal aborts in
the C library caused by memory allocation errors, etc.

Alas, while PostgreSQL may have dumped a core file I doubt there's any
debug information in your build. If you do find a core file for that
process ID, it might be worth checking for a debuginfo rpm just in case.

> In fact those last 3 lines are repeated over and over again repeatedly
> until "UTC4115FATAL: the database system is in recovery mode" is logged
> for 4 hours. At some point, 4 hours later of course, it appears that the
> system recovers.

Wow. Four hours recovery with default checkpoint settings.

Is it possible that the server was completely overloaded and was
swapping heavily? That could explain why VACUUM timed out in the first
place, and would explain why it took such a long time to recover. Check
your system logs around the same time for other indications of excessive
load, and check your monitoring history if you have monitoring like
Cacti or the like active.

See if there's anything interesting in the kernel logs too.

Just for completeness, can you send all non-commented-out, non-blank
lines in your postgresql.conf ?

$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

--
Craig Ringer

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Johnston 2011-05-31 01:04:24 [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters
Previous Message Craig Ringer 2011-05-30 23:35:53 Re: deadlock problem