Re: Weird failure with latches in curculio on v15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Fujii Masao <fujii(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-03 07:44:22
Message-ID: 20230203074422.wx2ysoraqh46zqg2@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-03 02:24:03 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > Ugh, I think I might understand what's happening:
>
> > The signal arrives just after the fork() (within system()). Because we
> > have all our processes configure themselves as process group leaders,
> > and we signal the entire process group (c.f. signal_child()), both the
> > child process and the parent will process the signal. So we'll end up
> > doing a proc_exit() in both. As both are trying to remove themselves
> > from the same PGPROC etc entry, that doesn't end well.
>
> Ugh ...
>
> > I don't see how we can solve that properly as long as we use system().
>
> ... but I don't see how that's system()'s fault? Doing the fork()
> ourselves wouldn't change anything about that.

If we did the fork ourselves, we'd temporarily change the signal mask
before the fork() and reset it immediately in the parent, but not in the
child. We can't do that with system(), because we don't get control back
early enough - we'd just block signals for the entire duration of
system().

I wonder if this shows a problem with the change in 14 to make pgarch.c
be attached to shared memory. Before that it didn't have to worry about
problems like the above in the archiver, but now we do. It's less severe
than the startup process issue, because we don't have a comparable
signal handler in pgarch, but still.

I'm e.g. not sure that there aren't issues with
procsignal_sigusr1_handler() or such executing in a forked process.

> > A workaround for the back branches could be to have a test in
> > StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
> > not do the proc_exit() if they don't match. We probably should just do
> > an _exit() in that case.
>
> Might work.

I wonder if we should add code complaining loudly about such a mismatch
to proc_exit(), in addition to handling it more silently in
StartupProcShutdownHandler(). Also, an assertion in
[Auxiliary]ProcKill that proc->xid == MyProcPid == getpid() seems like a
good idea.

> > OTOH, the current approach only works on systems with setsid(2) support,
> > so we probably shouldn't rely so hard on it anyway.
>
> setsid(2) is required since SUSv2, so I'm not sure which systems
> are of concern here ... other than Redmond's of course.

I was thinking of windows, yes.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-02-03 07:46:36 Re: Weird failure with latches in curculio on v15
Previous Message Heikki Linnakangas 2023-02-03 07:42:52 Re: In-placre persistance change of a relation