| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
| Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process |
| Date: | 2026-04-29 10:49:24 |
| Message-ID: | CAEze2WhhTnSLpjGJWGupbxkTp_JdNP6v0mNgpqhi_YkXJa=m6A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 22 Apr 2026 at 21:05, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> > If the PSB is emitted (and signaled to checkpointer) before the
> > checkpointer has registered its SIGUSR1 handler, then the checkpointer
> > won't receive the notice to check its procsignal slots, it won't
> > notice the updated procsignal flags, and it won't process the PSB; not
> > until it receives a new SIGUSR1.
> >
> > Signals are sent to all processes that have their procsignal pss_pid
> > set, which is true for every process which has called ProcSignalInit,
> > which for the checkpointer (like other aux processes) happens in
> > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> > processes) calls AuxiliaryProcessMainCommon before registering its
> > signal handlers, creating a small window in time where signals are
> > sent, but not handled.
>
> Hm. Have we confirmed this happens?
>
> CheckpointerMain() is called with all signals masked, so it should be ok for
> the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
> long as it happens before [...]
Yeah, that was a misidentification of the exact race that caused the issue.
On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > Hello Sawada-san,
> >
> > 24.04.2026 20:52, Masahiko Sawada wrote:
> >
> > Right. The postmaster blocks all signals before starting child process
> > as the following comment explains:
> >
> > /*
> > * We start postmaster children with signals blocked. This allows them to
> > * install their own handlers before unblocking, to avoid races where they
> > * might run the postmaster's handler and miss an important control
> > * signal. With more analysis this could potentially be relaxed.
> > */
> > sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
> >
> > Investigating the issue, I found there is a race condition between the
> > procsignal initialization and emitting signal barrier that could be
> > the cause of this issue. Imagine the following scenario:
Ah, that'd be it indeed. Thanks!
> I've attached a patch to address the issue. I haven't verified it
> across all versions yet, but I suspect it exists in the stable
> branches as well. Previously, the issue rarely occurred because
> EmitProcSignalBarrier() was only used for smgr invalidation. However,
> now that we use signal barriers for online wal_level changes and
> checksum status updates, this race condition is likely to be
> encountered more frequently.
Yes, I think the boot process with the xlog_logical_info barrier is
more likely to hit this issue; as indicated by two known detected
cases in various CI jobs; though it could also be that the lockup of
the new barrier is just exceptionally bad for system stability.
As for the patches:
v1-0001 -- LGTM.
0001 (upthread): LGTM, but I'd also suggest to add some code to make
sure that we're actually receiving procsignals by the time we
initialize the Logical/Checksum subsystems that need to process shared
state changes by responding to procsignals; as attached. smgr's
procsignal doesn't really depend on shared memory state, so I've kept
that out of my patch.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch | application/octet-stream | 2.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dilip Kumar | 2026-04-29 11:11:17 | Re: Include schema-qualified names in publication error messages. |
| Previous Message | Dilip Kumar | 2026-04-29 10:38:44 | Re: Include schema-qualified names in publication error messages. |