Postmaster's handing of startup-process crash is busted

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Postmaster's handing of startup-process crash is busted
Date: 2015-07-08 20:53:48
Message-ID: 13743.1436388828@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My Salesforce colleagues observed a failure mode in which a bug in the
crash recovery logic caused the startup process to get a SEGV while trying
to recover after a backend crash. The postmaster should have given up at
that point, but instead it kept on respawning the startup process, which
of course just kept on crashing. The cause seems to be a bug in my old
commit 442231d7f71764b8: if FatalError has been set, we suppose that the
startup process died because we SIGQUIT'd it, which is simply wrong in
this case.

AFAICS the only way to fix this properly is to explicitly track whether we
sent the startup process a kill signal. I started out with a separate
boolean, but after awhile decided that it'd be better to invent an enum
representing the startup process state, which could also subsume the
existing but rather ad-hoc flag RecoveryError. So that led me to the
attached patch.

Any thoughts/objections?

regards, tom lane

Attachment Content-Type Size
fix-startup-process-crash-handling.patch text/x-diff 6.3 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-07-08 21:05:37 Re: PL/pgSQL, RAISE and error context
Previous Message Paul Ramsey 2015-07-08 20:51:20 Re: Hashable custom types