Re: UTC4115FATAL: the database system is in recovery mode

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mathew Samuel <Mathew(dot)Samuel(at)entrust(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: UTC4115FATAL: the database system is in recovery mode
Date: 2011-06-02 17:42:58
Message-ID: BANLkTindfxAEGUt+kdqGFBAtRVuoVikwNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, May 27, 2011 at 2:26 PM, Mathew Samuel
<Mathew(dot)Samuel(at)entrust(dot)com> wrote:
> I see the following error as found in pg.log:
> UTC4115FATAL:  the database system is in recovery mode
>
> Actually that message was logged repeatedly for about 4 hours according to
> the logs (I don't have access to the system itself, just the logs).
>
> Leading up to that error were the following in pg.log:
> 2011-03-28 10:44:06 UTC3609LOG:  checkpoints are occurring too frequently
> (11 seconds apart)
> 2011-03-28 10:44:06 UTC3609HINT:  Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:18 UTC3609LOG:  checkpoints are occurring too frequently
> (12 seconds apart)
> 2011-03-28 10:44:18 UTC3609HINT:  Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:28 UTC3609LOG:  checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:28 UTC3609HINT:  Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG:  checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT:  Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR:  canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT:  vacuum full analyze _zamboni.sl_log_1
> 2011-03-28 10:44:42 UTC3932PANIC:  cannot abort transaction 1827110275, it
> was already committed
> 2011-03-28 10:44:42 UTC3566LOG:  server process (PID 3932) was terminated by
> signal 6
> 2011-03-28 10:44:42 UTC3566LOG:  terminating any other active server
> processes
> 2011-03-28 10:44:42 UTC13142WARNING:  terminating connection because of
> crash of another server process
> 2011-03-28 10:44:42 UTC13142DETAIL:  The postmaster has commanded this
> server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2011-03-28 10:44:42 UTC13142HINT:  In a moment you should be able to
> reconnect to the database and repeat your command.
> 2011-03-28 10:44:42 UTC29834WARNING:  terminating connection because of
> crash of another server process
> 2011-03-28 10:44:42 UTC29834DETAIL:  The postmaster has commanded this
> server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2011-03-28 10:44:42 UTC29834HINT:  In a moment you should be able to
> reconnect to the database and repeat your command.
> 2011-03-28 10:44:42 UTC3553WARNING:  terminating connection because of crash
> of another server process
> 2011-03-28 10:44:42 UTC3553DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> In fact those last 3 lines are repeated over and over again repeatedly until
> "UTC4115FATAL:  the database system is in recovery mode" is logged for 4
> hours. At some point, 4 hours later of course, it appears that the system
> recovers.
>
> The Checkpoints Settings in postgresql.conf are commented out so I guess the
> defaults are being used:
> #checkpoint_segments = 3                # in logfile segments, min 1, 16MB
> each
> #checkpoint_timeout = 5min              # range 30s-1h
> #checkpoint_warning = 30s               # 0 is off
> That system where this was seen was using pgsql-8.2.6 on RHEL4.
>
> Not sure if this is a known bug (or if it is a bug at all or something I can
> address using different configuration) but I thought I would post here first
> if any one might be familiar with this issue and suggest a possible
> solution. Any ideas?

Not really, although I've seen a few other crashes attributable to
trying to cancel VACUUM FULL. As of PostgreSQL 9.0, VACUUM FULL has
been completely rewritten, so whatever bug is causing this problem is
almost certainly gone in more recent versions. It would be nice to
track done and fix whatever is wrong in the older branches, but we'd
probably need a reproducible test case of some sort. I don't think it
craps out like that EVERY time you interrupt it...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alexey Klyukin 2011-06-02 17:49:59 Re: BUG #6048: TRUNCATE vs TRUNCATE CASCADE: misleading message
Previous Message Robert Haas 2011-06-02 17:39:41 Re: BUG #5926: information schema dtd_identifier for element_types, columns, parameters views inconsistent