Re: Unportable implementation of background worker start

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rémi Zara <remi_zara(at)mac(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, cm(at)enterprisedb(dot)com, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unportable implementation of background worker start
Date: 2017-04-26 15:42:38
Message-ID: 4707.1493221358@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> =?utf-8?Q?R=C3=A9mi_Zara?= <remi_zara(at)mac(dot)com> writes:
>> coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores.
>> I’ve done the clean-up.

> Huh, that's even more interesting.

I installed NetBSD 5.1.5 on an old Mac G4; I believe this is a reasonable
approximation to coypu's environment. With the pselect patch installed,
I can replicate the behavior we saw in the buildfarm of connections
immediately failing with "the database system is starting up".
Investigation shows that pselect reports ready sockets correctly (which is
what allows connections to get in at all), and it does stop waiting either
for a signal or for a timeout. What it forgets to do is to actually
service the signal. The observed behavior is caused by the fact that
reaper() is never called so the postmaster never realizes that the startup
process has finished.

I experimented with putting

PG_SETMASK(&UnBlockSig);
PG_SETMASK(&BlockSig);

immediately after the pselect() call, and found that indeed that lets
signals get serviced, and things work pretty much normally.

However, closer inspection finds that pselect only stops waiting when
a signal arrives *while it's waiting*, not if there was a signal already
pending. So this is actually even more broken than the so called "non
atomic" behavior we had expected to see --- at least with that, the
pending signal would have gotten serviced promptly, even if ServerLoop
itself didn't iterate.

This is all giving me less than warm fuzzy feelings about the state of
pselect support out in the real world.

So at this point we seem to have three plausible alternatives:

1. Let HEAD stand as it is. We have a problem with slow response to
bgworker start requests that arrive while ServerLoop is active, but that's
a pretty tight window usually (although I believe I've seen it hit at
least once in testing).

2. Reinstall the pselect patch, blacklisting NetBSD and HPUX and whatever
else we find to be flaky. Then only the blacklisted platforms have the
problem.

3. Go ahead with converting the postmaster to use WaitEventSet, a la
the draft patch I posted earlier. I'd be happy to do this if we were
at the start of a devel cycle, but right now seems a bit late --- not
to mention that we really need to fix 9.6 as well.

We could substantially ameliorate the slow-response problem by allowing
maybe_start_bgworker to launch multiple workers per call, which is
something I think we should do regardless. (I have a patch written to
allow it to launch up to N workers per call, but have held off committing
that till after the dust settles in ServerLoop.)

I'm leaning to doing #1 plus the maybe_start_bgworker change. There's
certainly room for difference of opinion here, though. Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2017-04-26 15:51:03 Re: subscription worker doesn't start immediately on eabled
Previous Message Jeff Janes 2017-04-26 15:40:35 Re: tablesync patch broke the assumption that logical rep depends on?