Postmaster doesn't correctly handle crashes in PM_STARTUP state

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Postmaster doesn't correctly handle crashes in PM_STARTUP state
Date: 2023-07-29 21:51:24
Message-ID: 20230729215124.ra4rbwck5dlawvmo@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While testing something I made the checkpointer process intentionally crash as
soon as it started up. The odd thing I observed on macOS is that we start a
*new* checkpointer before shutting down:

2023-07-29 14:32:39.241 PDT [65031] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-07-29 14:32:39.244 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.244 PDT [65031] LOG: checkpointer process (PID 65032) was terminated by signal 11: Segmentation fault: 11
2023-07-29 14:32:39.244 PDT [65031] LOG: terminating any other active server processes
2023-07-29 14:32:39.244 PDT [65031] DEBUG: sending SIGQUIT to process 65034
2023-07-29 14:32:39.245 PDT [65031] DEBUG: sending SIGQUIT to process 65033
2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.245 PDT [65035] LOG: process 65035 taking over ProcSignal slot 126, but it's not empty
2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.245 PDT [65031] LOG: shutting down because restart_after_crash is off

Note that a new process (65035) is started after the crash has been
observed. I added logging to StartChildProcess(), and the process that's
started is another checkpointer.

I could not initially reproduce this on linux.

After a fair bit of confusion, I figured out the reason: On macOS it takes a
bit longer for the startup process to finish, which means we're still in
PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or
...

The problem is that unfortunately HandleChildCrash() doesn't change pmState
when in PM_STARTUP:

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;

Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the
problem reproduces on linux as well.

I haven't fully dug through the history, this looks to be a quite old problem.

Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad
consequence of that.

Greetings,

Andres Freund

Browse pgsql-hackers by date

  From Date Subject
Next Message José Neves 2023-07-29 23:07:24 CDC/ETL system on top of logical replication with pgoutput, custom client
Previous Message Nathan Bossart 2023-07-29 21:40:10 Re: should frontend tools use syncfs() ?