| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
| Subject: | Startup process deadlock: WaitForProcSignalBarriers vs aux process |
| Date: | 2026-04-22 11:21:02 |
| Message-ID: | CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Over in the Hackers Discord, Melany pointed out [0] a random failure
of tests on the master branch, which seemed to have nothing to do with
the commit they failed on.
The logs [1] indicate that the startup process was waiting for another
process to process a signal barrier. While there isn't enough
information available to conclusively point the blame on any specific
component, I think I have a good understanding of what happened:
>> 2026-04-21 15:10:50.065 UTC startup[19246] LOG: still waiting for backend with PID 19244 to accept ProcSignalBarrier
Here, the startup process is waiting for process with PID 19244 to
handle a signal barrier. It is not entirely clear which process it's
waiting on, but we can deduce this:
In the startup sequence, the postmaster creates these child processes,
in short order:
1. checkpointer
2. bgwriter
3. startup
It is therefore likely that the startup process' PID is just two
larger than that of the checkpointer; and therefore, it's likely the
startup process is waiting for the checkpointer process.
# Which code in the Startup process is waiting?
I think it's this: The startup process logged that it started with a
clean shutdown, so no recovery code should be executed. This excludes
most possible call sites of WaitForProcSignalBarriers, except this
one: The startup process calls StartupXLOG ->
UpdateLogicalDecodingStatusEndOfRecovery(), which then calls
if (IsUnderPostmaster)
WaitForProcSignalBarrier(
EmitProcSignalBarrier(
PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO
));
# Why doesn't the Checkpointer process acknowledge the ProcSignalBarrier?
If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.
Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.
# Is this new?
The issue of registering signal handlers only after opening the
process up to receiving signals has existed for a long time (unchanged
since at least 2022), only the ProcSignalBarrier in the startup
process is new: UpdateLogicalDecodingStatusEndOfRecovery was added
with Sawada-san's 67c20979.
# A solution?
I don't have one right now.
I was thinking in the direction of having a compile-time aux process
signal handlers array per process type, which is read by
AuxiliaryProcessMainCommon() to register the signal handlers ahead of
ProcSignalInit(), but I've not yet looked at the exact implications,
nor analyzed whether that's actually safe. It would move some
duplicative code patterns into compile-time structs, but that's not
necessarily a universal good.
Kind regards,
Matthias van de Meent
[0] https://discord.com/channels/1258108670710124574/1346208113132568646/1496179622591598592
[1] https://api.cirrus-ci.com/v1/artifact/task/6239099197063168/log/contrib/auto_explain/log/postmaster.log
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Korotkov | 2026-04-22 11:24:29 | Re: MERGE PARTITIONS and DEPENDS ON EXTENSION. |
| Previous Message | Dean Rasheed | 2026-04-22 11:07:20 | Re: [BUG]: WHERE CURRENT OF cursor fail on tables that have virtual generated columns |