Re: [HACKERS] parallel.c oblivion of worker-startup failures

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date: 2018-01-23 22:10:51
Message-ID: CAH2-Wzn=9WuLs_03J47qrW8PVCvkthOKQb89v4TS9XkPQLQSZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 23, 2018 at 1:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I was just talking to Thomas, and one or the other of us realized that
> this doesn't completely fix the problem. I think that if a worker
> fails, even by some unorthodox mechanism like an abrupt proc_exit(1),
> after the point where it attached to the error queue, this patch is
> sufficient to make that reliably turn into an error. But what if it
> fails before that - e.g. fork() failure? In that case, this process
> will catch the problem if the calling process calls
> WaitForParallelWorkersToFinish, but does that actually happen in any
> interesting cases? Maybe not.
>
> In the case of Gather, for example, we won't
> WaitForParallelWorkersToFinish() until we've read all the tuples. If
> there's a tuple queue that does not yet have a sender, then we'll just
> wait for it to get one, even if the sender fell victim to a fork
> failure and is never going to show up.
>
> We could *almost* fix this by having execParallel.c include a Barrier
> in the DSM, similar to what I proposed for parallel CREATE INDEX.
> When a worker starts, it attaches to the barrier; when it exits, it
> detaches. Instead of looping until the leader is done and all the
> tuple queues are detached, Gather or Gather Merge could loop until the
> leader is done and everyone detaches from the barrier. But that would
> require changes to the Barrier API, which isn't prepared to have the
> caller alternate waiting with other activity like reading the tuple
> queues, which would clearly be necessary in this case. Moreover, it
> doesn't work at all when parallel_leader_participation=off, because in
> that case we'll again just wait forever for some worker to show up,
> and if none of them do, then we're hosed.

I'm relieved that we all finally seem to be on the same page on this issue.

> One idea to fix this is to add a new function in parallel.c with a
> name like CheckForFailedWorkers() that Gather and Gather Merge call
> just before they WaitLatch(). If a worker fails to start, the
> postmaster will send SIGUSR1 to the leader, which will set the latch,
> so we can count on the function being run after every worker death.
> The function would go through and check each worker for which
> pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to
> see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) ==
> BGWH_STOPPED. If so, ERROR.

Sounds workable to me.

> This lets us *run* for arbitrary periods of time without detecting a
> fork failure, but before we *wait* we will notice. Getting more
> aggressive than that sounds expensive, but I'm open to ideas.

Is that really that different to any other case? I imagine that in
practice most interrupt checks happen within the leader in a
WaitLatch() loop (or similar). Besides, the kinds of failures we're
talking about are incredibly rare in the real world. I think it's fine
if the leader is not very promptly aware of when this happens, or if
execution becomes slow in some other way.

> If we adopt this approach, then Peter's patch could probably make use
> of it, too. It's a little tricky because ConditionVariableSleep()
> tries not to return spuriously, so we'd either have to change that or
> stick CheckForFailedWorkers() into that function's loop. I suspect
> the former is a better approach. Or maybe that patch should still use
> the Barrier approach and avoid all of this annoyance.

I have the same misgivings about using a barrier to solve this problem
for parallel CREATE INDEX as before. Specifically: why should
nbtsort.c solve this problem for itself? I don't think that it's in
any way special. One can imagine exactly the same problem with other
parallel DDL implementations, fixable using exactly the same solution.

I'm not saying that nbtsort.c shouldn't have to do anything
differently; just that it should buy into some general solution. That
said, it would be ideal if the current approach I take in nbtsort.c
didn't have to be changed *at all*, because if it was 100% correct
already then I think we'd have a more robust parallel.h interface (it
is certainly a "nice to have" for me). The fact that multiple hackers
at least tacitly assumed that nworkers_launched is a reliable
indicator of how many workers will show up tells us something.
Moreover, parallel_leader_participation=off is always going to be
wonky under a regime based on "see who shows up", because failure
becomes indistinguishable from high latency -- I'd rather avoid having
to think about nbtsort.c in distributed systems terms.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2018-01-23 22:42:17 For consideration - clarification of multi-dimensional arrays in our docs.
Previous Message Alvaro Herrera 2018-01-23 22:10:27 Re: FOR EACH ROW triggers on partitioned tables