From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Postmaster self-deadlock due to PLT linkage resolution |
Date: | 2022-08-29 19:43:55 |
Message-ID: | 3384826.1661802235@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Buildfarm member mamba (NetBSD-current on prairiedog's former hardware)
has failed repeatedly since I set it up. I have now run the cause of
that to ground [1], and here's what's happening: if the postmaster
receives a signal just before it first waits at the select() in
ServerLoop, it can self-deadlock. During the postmaster's first use of
select(), the dynamic loader needs to resolve the PLT branch table entry
that the core executable uses to reach select() in libc.so, and it locks
the loader's internal data structures while doing that. If we enter
a signal handler while the lock is held, and the handler needs to do
anything that also requires the lock, the postmaster is frozen.
The probability of this happening seems remarkably small, since there's
only one narrow window per postmaster lifetime, and there's just not
that many potential signal causes active at that time either.
Nonetheless I have traces showing it happening (1) because we receive
SIGCHLD for startup process termination and (2) because we receive
SIGUSR1 from the startup process telling us to start walreceivers.
I guess that mamba's slow single-CPU hardware interacts with the
NetBSD scheduler in just the right way to make it more probable than
you'd think. On typical modern machines, the postmaster would almost
certainly manage to wait before the startup process is able to signal
it. Still, "almost certainly" is not "certainly".
The attached patch seems to fix the problem, by forcing resolution of
the PLT link before we unblock signals. It depends on the assumption
that another select() call appearing within postmaster.c will share
the same PLT link, which seems pretty safe.
I'd originally intended to make this code "#ifdef __NetBSD__",
but on looking into the FreeBSD sources I find much the same locking
logic in their dynamic loader, and now I'm wondering if such behavior
isn't pretty standard. The added calls should have negligible cost,
so it doesn't seem unreasonable to do them everywhere.
(Of course, a much better answer is to get out of the business of
doing nontrivial stuff in signal handlers. But even if we get that
done soon, we'd surely not back-patch it.)
Thoughts?
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
fix-PLT-links-before-unblocking-signals.patch | text/x-diff | 1.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2022-08-29 20:27:42 | Re: New strategies for freezing, advancing relfrozenxid early |
Previous Message | Robert Haas | 2022-08-29 19:38:57 | Re: replacing role-level NOINHERIT with a grant-level option |