Startup process deadlock: WaitForProcSignalBarriers vs aux process

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Date: 2026-04-22 11:21:02
Message-ID: CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Over in the Hackers Discord, Melany pointed out [0] a random failure
of tests on the master branch, which seemed to have nothing to do with
the commit they failed on.

The logs [1] indicate that the startup process was waiting for another
process to process a signal barrier. While there isn't enough
information available to conclusively point the blame on any specific
component, I think I have a good understanding of what happened:

>> 2026-04-21 15:10:50.065 UTC startup[19246] LOG: still waiting for backend with PID 19244 to accept ProcSignalBarrier

Here, the startup process is waiting for process with PID 19244 to
handle a signal barrier. It is not entirely clear which process it's
waiting on, but we can deduce this:

In the startup sequence, the postmaster creates these child processes,
in short order:
1. checkpointer
2. bgwriter
3. startup

It is therefore likely that the startup process' PID is just two
larger than that of the checkpointer; and therefore, it's likely the
startup process is waiting for the checkpointer process.

# Which code in the Startup process is waiting?

I think it's this: The startup process logged that it started with a
clean shutdown, so no recovery code should be executed. This excludes
most possible call sites of WaitForProcSignalBarriers, except this
one: The startup process calls StartupXLOG ->
UpdateLogicalDecodingStatusEndOfRecovery(), which then calls

if (IsUnderPostmaster)
WaitForProcSignalBarrier(
EmitProcSignalBarrier(
PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO
));

# Why doesn't the Checkpointer process acknowledge the ProcSignalBarrier?

If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.

Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.

# Is this new?

The issue of registering signal handlers only after opening the
process up to receiving signals has existed for a long time (unchanged
since at least 2022), only the ProcSignalBarrier in the startup
process is new: UpdateLogicalDecodingStatusEndOfRecovery was added
with Sawada-san's 67c20979.

# A solution?

I don't have one right now.
I was thinking in the direction of having a compile-time aux process
signal handlers array per process type, which is read by
AuxiliaryProcessMainCommon() to register the signal handlers ahead of
ProcSignalInit(), but I've not yet looked at the exact implications,
nor analyzed whether that's actually safe. It would move some
duplicative code patterns into compile-time structs, but that's not
necessarily a universal good.

Kind regards,

Matthias van de Meent

[0] https://discord.com/channels/1258108670710124574/1346208113132568646/1496179622591598592
[1] https://api.cirrus-ci.com/v1/artifact/task/6239099197063168/log/contrib/auto_explain/log/postmaster.log

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2026-04-22 11:24:29 Re: MERGE PARTITIONS and DEPENDS ON EXTENSION.
Previous Message Dean Rasheed 2026-04-22 11:07:20 Re: [BUG]: WHERE CURRENT OF cursor fail on tables that have virtual generated columns