Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Date: 2026-04-27 18:00:00
Message-ID: 4358bd85-f6b4-4da6-9909-74428fe3c8f7@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Sawada-san,

24.04.2026 20:52, Masahiko Sawada wrote:
> Right. The postmaster blocks all signals before starting child process
> as the following comment explains:
>
> /*
> * We start postmaster children with signals blocked. This allows them to
> * install their own handlers before unblocking, to avoid races where they
> * might run the postmaster's handler and miss an important control
> * signal. With more analysis this could potentially be relaxed.
> */
> sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
>
> Investigating the issue, I found there is a race condition between the
> procsignal initialization and emitting signal barrier that could be
> the cause of this issue. Imagine the following scenario:
>
> 1. In ProcSignalInit(), the checkpointer initializes its
> slot->pss_barrierGeneration with the global generation.
> 2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
> procsignal slot but it skips emitting the signal as slot->pss_pid is
> still 0. It can happen even though the checkpointer holds a spinlock
> on its slot during the initialization because the first pid check is
> done without a spinlock acquisition.
> 3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
> 4. In WaitForProcSignalBarrier(), the startup checks the
> checkpointer's procsignal slot that has already initialized the
> pss_barrierGeneration, and waits for it to be updated. However, the
> checkpointer never updates its barrier generation as it doesn't get
> the signal.

Thank you for the investigation and explanation of the issue!

I've been puzzled by a buildfarm failure [1] with such symptoms for a while
and even reproduced it locally once, but couldn't gather more information
that time. But now that you have described the scenario, I can easily
reproduce the same test failure with:
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
        if (cancel_key_len > 0)
                memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
        slot->pss_cancel_key_len = cancel_key_len;
+pg_usleep(10000);
        pg_atomic_write_u32(&slot->pss_pid, MyProcPid);

just running `meson test test_oat_hooks_*/regress` with the test multiplied x30:
26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress         OK 1.28s   2 subtests passed
27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress         OK 1.25s   2 subtests passed
28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress           ERROR 62.49s   exit status 2

2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-16.0.1, 64-bit
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086"
2026-04-27 17:34:44.302 UTC startup[1578114] LOG:  database system was shut down at 2026-04-27 17:34:44 UTC
2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL:  the database system is starting up
...
2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL:  the database system is starting up
2026-04-27 17:34:49.308 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL:  the database system is starting up
...
2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL:  the database system is starting up
2026-04-27 17:35:44.351 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL:  the database system is starting up

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2026-03-10%2013%3A58%3A55

Best regards,
Alexander

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2026-04-27 18:07:14 Re: Randomize B-Tree page split location to avoid oscillating patterns
Previous Message lakshmi 2026-04-27 17:10:21 Re: Use log_newpage_range in HASH index build