PANIC serves too many masters

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PANIC serves too many masters
Date: 2023-11-18 22:29:11
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Right now we use PANIC for very different kinds of errors.

Sometimes for errors that are persistent, where crash-restarting and trying
again won't help:
(errmsg("could not locate a valid checkpoint record")));
(errmsg("online backup was canceled, recovery cannot continue")));

Sometimes for errors that could be transient, e.g. when running out of space
while trying to write out WAL:
errmsg("could not write to file \"%s\": %m", tmppath)));
(the ERROR is often promoted to a PANIC due to critical sections).
errmsg("could not write to log file \"%s\" at offset %u, length %zu: %m",
xlogfname, startoffset, nleft)));

Sometimes for "should never happen" checks that are important enough to check
in production builds:
elog(PANIC, "stuck spinlock detected at %s, %s:%d",
elog(PANIC, "failed to re-find shared proclock object");

I have two main issues with this:

1) If core dumps are allowed, we trigger core dumps for all of these. That's
good for "should never happen" type of errors like a stuck spinlock. But it
makes no sense for things like the on-disk state being wrong at startup, or
running out of space while writing WAL - if anything, it might make that

It's very useful to be able to collect dumps for crashes in production, but
it's not useful to generate thousands of identical cores because crash
recovery fails with out-of-space and we retry over and over.

2) For errors where crash-restarting won't fix anything, using PANIC doesn't
allow postmaster to distinguish between an error that should lead
postmaster to exit itself (after killing other processes, obviously) and
the normal crash restart cycle.

I've been trying to do some fleet wide analyses of the causes of crashes, but
having core dumps for lots of stuff that aren't crashes, often repeated many
times, makes that much harder. Filtering out abort()s and just looking at
segfaults filters out far too much.

I don't quite know what we should do. But the current situation decidedly
doesn't seem great.

Maybe we could have:
- PANIC_BUG - triggers abort() followed by a crash restart cycle
to be used for things like a stuck spinlock
- PANIC_RETRY - causes a crash restart cycle, no core dump
to be used for things like ENOSPC during WAL writes
- PANIC_EXIT - causes postmaster to exit(1)
to be used for things where retrying won't help, like
"requested recovery stop point is before consistent recovery point"

One could argue that some of the PANICs that want to just shut down the server
should instead be FATALs, with knowledge in postmaster about which/when such
errors should trigger exiting. We do have something like this for the startup
process, but only when errors happen "early enough", and without being able to
distinguish between "retryable" and "should exit" type errors. But ISTM that
that requires adding more and more knowledge to postmaster.c, instead of
leaving it with the code that raises the error.


Andres Freund


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-11-18 22:59:18 errcode_for_file_access() maps EROFS to INSUFFICIENT_PRIVILEGE
Previous Message Andres Freund 2023-11-18 21:49:15 Re: Use of backup_label not noted in log