| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process |
| Date: | 2026-04-27 18:00:00 |
| Message-ID: | 4358bd85-f6b4-4da6-9909-74428fe3c8f7@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello Sawada-san,
24.04.2026 20:52, Masahiko Sawada wrote:
> Right. The postmaster blocks all signals before starting child process
> as the following comment explains:
>
> /*
> * We start postmaster children with signals blocked. This allows them to
> * install their own handlers before unblocking, to avoid races where they
> * might run the postmaster's handler and miss an important control
> * signal. With more analysis this could potentially be relaxed.
> */
> sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
>
> Investigating the issue, I found there is a race condition between the
> procsignal initialization and emitting signal barrier that could be
> the cause of this issue. Imagine the following scenario:
>
> 1. In ProcSignalInit(), the checkpointer initializes its
> slot->pss_barrierGeneration with the global generation.
> 2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
> procsignal slot but it skips emitting the signal as slot->pss_pid is
> still 0. It can happen even though the checkpointer holds a spinlock
> on its slot during the initialization because the first pid check is
> done without a spinlock acquisition.
> 3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
> 4. In WaitForProcSignalBarrier(), the startup checks the
> checkpointer's procsignal slot that has already initialized the
> pss_barrierGeneration, and waits for it to be updated. However, the
> checkpointer never updates its barrier generation as it doesn't get
> the signal.
Thank you for the investigation and explanation of the issue!
I've been puzzled by a buildfarm failure [1] with such symptoms for a while
and even reproduced it locally once, but couldn't gather more information
that time. But now that you have described the scenario, I can easily
reproduce the same test failure with:
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
+pg_usleep(10000);
pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
just running `meson test test_oat_hooks_*/regress` with the test multiplied x30:
26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress OK 1.28s 2 subtests passed
27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress OK 1.25s 2 subtests passed
28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress ERROR 62.49s exit status 2
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-16.0.1, 64-bit
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG: listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086"
2026-04-27 17:34:44.302 UTC startup[1578114] LOG: database system was shut down at 2026-04-27 17:34:44 UTC
2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL: the database system is starting up
...
2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL: the database system is starting up
2026-04-27 17:34:49.308 UTC startup[1578114] LOG: still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL: the database system is starting up
...
2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL: the database system is starting up
2026-04-27 17:35:44.351 UTC startup[1578114] LOG: still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL: the database system is starting up
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2026-03-10%2013%3A58%3A55
Best regards,
Alexander
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Geoghegan | 2026-04-27 18:07:14 | Re: Randomize B-Tree page split location to avoid oscillating patterns |
| Previous Message | lakshmi | 2026-04-27 17:10:21 | Re: Use log_newpage_range in HASH index build |