WAL recovery is broken by FSM patch

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: WAL recovery is broken by FSM patch
Date: 2008-09-30 22:52:15
Message-ID: 27934.1222815135@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I just managed to make a backend dump core while fooling with the CTE
patch, and found out that the system failed to recover, because the
ensuing startup process *also* dumped core. Here's the backtrace:

Core was generated by `postgres: startup'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000048df59 in XLogInsert (rmid=2 '\002', info=32 ' ',
rdata=0x7fff41713550) at xlog.c:813
813 record->xl_prev = Insert->PrevRecord;
(gdb) bt
#0 0x000000000048df59 in XLogInsert (rmid=2 '\002', info=32 ' ',
rdata=0x7fff41713550) at xlog.c:813
#1 0x00000000005ec8d0 in smgrtruncate (reln=0x206a148, forknum=FSM_FORKNUM,
nblocks=3, isTemp=0 '\0') at smgr.c:594
#2 0x00000000005dc194 in FreeSpaceMapTruncateRel (rel=0x2072050, nblocks=15)
at freespace.c:275
#3 0x00000000005dc2ee in fsm_redo (lsn=<value optimized out>,
record=<value optimized out>) at freespace.c:779
#4 0x000000000049003f in StartupXLOG () at xlog.c:5146
#5 0x00000000004a9cd8 in AuxiliaryProcessMain (argc=2, argv=0x7fff41713790)
at bootstrap.c:420
#6 0x00000000005bd24d in StartChildProcess (type=StartupProcess)
at postmaster.c:4074
#7 0x00000000005c053f in PostmasterStateMachine () at postmaster.c:2737
#8 0x00000000005c0965 in reaper (postgres_signal_arg=<value optimized out>)
at postmaster.c:2325
#9 <signal handler called>
#10 0x0000003f71edcbb3 in __select_nocancel () from /lib64/libc.so.6
#11 0x00000000006ce41a in pg_usleep (microsec=<value optimized out>)
at pgsleep.c:43
#12 0x00000000005bed05 in ServerLoop () at postmaster.c:1232
#13 0x00000000005bf99a in PostmasterMain (argc=3, argv=0x203a890)
at postmaster.c:1031
#14 0x0000000000568fd8 in main (argc=3, argv=0x203a890) at main.c:188

We should of course not be attempting XLogInsert during WAL replay.
Now smgr_redo knows about that. I rather wonder why fsm_redo is
attempting to call smgrtruncate at all, seeing that there's presumably
smgr's own redo record to tell it to deal with that. I think that all
fsm_redo need do is clear out the last untruncated block of FSM.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-09-30 22:52:31 Infrastructure changes for recovery (v8)
Previous Message Greg Stark 2008-09-30 22:49:17 Re: Block-level CRC checks