Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: "PANIC: cannot make new WAL entries during recovery" in the wild
Date: 2009-08-07 17:51:53
Message-ID: 24668.1249667513@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Today we got a report in the spanish list about the message in $subject.
> The server is 8.4 running on Windows.

I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
a backend that was in the midst of a COPY IN operation (I was trying to
reproduce Neil Best's unrelated issue...) The server log is

LOG: server process (PID 23846) was terminated by signal 9
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2009-08-07 11:27:36 EDT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/1B9D7790
LOG: unexpected pageaddr 0/1532E000 in log file 0, segment 28, offset 3334144
LOG: redo done at 0/1C32D200
PANIC: cannot make new WAL entries during recovery
LOG: startup process (PID 23883) was terminated by signal 6
LOG: aborting startup due to startup process failure

and the stack trace of the panic'd startup process looks like

#4 0x4b6e20 in errfinish (dummy=1) at elog.c:503
#5 0x4b86a0 in elog_finish (elevel=1073803952, fmt=0x7b0394b0 "") at elog.c:1142
#6 0x1f722c in XLogInsert (rmid=11 '\013', info=114 'r', rdata=0xc004d07c) at xlog.c:555
#7 0x1df290 in _bt_insertonpg (rel=0x4006cf28, buf=70, stack=0x3, itup=0x4006d150, newitemoff=38,
split_only_page=0) at nbtinsert.c:833
#8 0x1e0898 in _bt_insert_parent (rel=0x4006cf28, buf=304, rbuf=854, stack=0x7b03b9d8, is_root=0, is_only=0)
at nbtinsert.c:1627
#9 0x1ef098 in btree_xlog_cleanup () at nbtxlog.c:927
#10 0x201c44 in StartupXLOG () at xlog.c:5767
#11 0x206134 in StartupProcessMain () at xlog.c:8034
#12 0x228d0c in AuxiliaryProcessMain (argc=2, argv=0x7b03b6d8) at bootstrap.c:433
#13 0x39bb68 in StartChildProcess (type=StartupProcess) at postmaster.c:4243

So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:

1. Decide that the check added to XLogInsert is wrong and take it out.

2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.

Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-08-07 18:13:26 Re: Fixing geometic calculation
Previous Message Sam Mason 2009-08-07 17:51:36 Re: Fixing geometic calculation