Postmaster self-deadlock due to PLT linkage resolution

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Postmaster self-deadlock due to PLT linkage resolution
Date: 2022-08-29 19:43:55
Message-ID: 3384826.1661802235@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Buildfarm member mamba (NetBSD-current on prairiedog's former hardware)
has failed repeatedly since I set it up. I have now run the cause of
that to ground [1], and here's what's happening: if the postmaster
receives a signal just before it first waits at the select() in
ServerLoop, it can self-deadlock. During the postmaster's first use of
select(), the dynamic loader needs to resolve the PLT branch table entry
that the core executable uses to reach select() in libc.so, and it locks
the loader's internal data structures while doing that. If we enter
a signal handler while the lock is held, and the handler needs to do
anything that also requires the lock, the postmaster is frozen.

The probability of this happening seems remarkably small, since there's
only one narrow window per postmaster lifetime, and there's just not
that many potential signal causes active at that time either.
Nonetheless I have traces showing it happening (1) because we receive
SIGCHLD for startup process termination and (2) because we receive
SIGUSR1 from the startup process telling us to start walreceivers.
I guess that mamba's slow single-CPU hardware interacts with the
NetBSD scheduler in just the right way to make it more probable than
you'd think. On typical modern machines, the postmaster would almost
certainly manage to wait before the startup process is able to signal
it. Still, "almost certainly" is not "certainly".

The attached patch seems to fix the problem, by forcing resolution of
the PLT link before we unblock signals. It depends on the assumption
that another select() call appearing within postmaster.c will share
the same PLT link, which seems pretty safe.

I'd originally intended to make this code "#ifdef __NetBSD__",
but on looking into the FreeBSD sources I find much the same locking
logic in their dynamic loader, and now I'm wondering if such behavior
isn't pretty standard. The added calls should have negligible cost,
so it doesn't seem unreasonable to do them everywhere.

(Of course, a much better answer is to get out of the business of
doing nontrivial stuff in signal handlers. But even if we get that
done soon, we'd surely not back-patch it.)

Thoughts?

regards, tom lane

[1] https://gnats.netbsd.org/56979

Attachment Content-Type Size
fix-PLT-links-before-unblocking-signals.patch text/x-diff 1.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-08-29 20:27:42 Re: New strategies for freezing, advancing relfrozenxid early
Previous Message Robert Haas 2022-08-29 19:38:57 Re: replacing role-level NOINHERIT with a grant-level option