| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process |
| Date: | 2026-04-24 17:52:31 |
| Message-ID: | CAD21AoBj+zKvgw_Q8gjr4YbKccW_uMe3OFQ5+KT246FHUuNXSQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Apr 22, 2026 at 12:05 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> > If the PSB is emitted (and signaled to checkpointer) before the
> > checkpointer has registered its SIGUSR1 handler, then the checkpointer
> > won't receive the notice to check its procsignal slots, it won't
> > notice the updated procsignal flags, and it won't process the PSB; not
> > until it receives a new SIGUSR1.
> >
> > Signals are sent to all processes that have their procsignal pss_pid
> > set, which is true for every process which has called ProcSignalInit,
> > which for the checkpointer (like other aux processes) happens in
> > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> > processes) calls AuxiliaryProcessMainCommon before registering its
> > signal handlers, creating a small window in time where signals are
> > sent, but not handled.
>
> Hm. Have we confirmed this happens?
>
> CheckpointerMain() is called with all signals masked, so it should be ok for
> the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
> long as it happens before
>
> /*
> * Unblock signals (they were blocked when the postmaster forked us)
> */
> sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
>
> as the signal delivery should be held until after unblocking signals.
Right. The postmaster blocks all signals before starting child process
as the following comment explains:
/*
* We start postmaster children with signals blocked. This allows them to
* install their own handlers before unblocking, to avoid races where they
* might run the postmaster's handler and miss an important control
* signal. With more analysis this could potentially be relaxed.
*/
sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:
1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.
Another similar issue I found would be that child processes could miss
the PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO signal during the
initialization and end up in an inconsistent state because
InitializeProcessXLogLogicalInfo() is called (in BaseInit()) before
ProcSignalInit(). If the startup emits the signal to a process who is
between two steps, the process would not reflect the latest
XLogLogicalInfo state. I think we should move
InitializeProcessXLogLogicalInfo() after ProcSignalInit() like we do
so for InitLocalDataChecksumState().
I've attached the patch for fixing the latter problem as the fix is
straightforward.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-Fix-race-condition-in-XLogLogicalInfo-and-ProcSignal.patch | text/x-patch | 4.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Haibo Yan | 2026-04-24 18:44:34 | Re: Implement missing join selectivity estimation for range types |
| Previous Message | Alexander Lakhin | 2026-04-24 17:00:00 | Re: meson: Make test output much more useful on failure (both in CI and locally) |