Re: [HACKERS] parallel.c oblivion of worker-startup failures

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date: 2018-01-24 04:43:47
Message-ID: CAA4eK1KphxenvAxVa9NnWzgVQBqj6D99KsrPwM3A_yLgtWLa7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 24, 2018 at 9:55 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> In the case of Gather, for example, we won't
>>> WaitForParallelWorkersToFinish() until we've read all the tuples. If
>>> there's a tuple queue that does not yet have a sender, then we'll just
>>> wait for it to get one, even if the sender fell victim to a fork
>>> failure and is never going to show up.
>>>
>>
>> Hmm, I think that case will be addressed because tuple queues can
>> detect if the leader is not attached. It does in code path
>> shm_mq_receive->shm_mq_counterparty_gone. In
>> shm_mq_counterparty_gone, it can detect if the worker is gone by using
>> GetBackgroundWorkerPid. Moreover, I have manually tested this
>> particular case before saying your patch is fine. Do you have some
>> other case in mind which I am missing?
>
> Hmm. Yeah. I can't seem to reach a stuck case and was probably just
> confused and managed to confuse Robert too. If you make
> fork_process() fail randomly (see attached), I see that there are a
> couple of easily reachable failure modes (example session at bottom of
> message):
>

In short, we are good with committed code. Right?

> 1. HandleParallelMessages() is reached and raises a "lost connection
> to parallel worker" error because shm_mq_receive() returns
> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
> GetBackgroundWorkerPid() just as you said. I guess that's happening
> because some other process is (coincidentally) sending
> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
> process is unexpectedly stopped.
>
> 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel
> worker failed to initialize" error. TupleQueueReaderNext() set done
> to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once
> again, that is because shm_mq_counterparty_gone() returned true. This
> is the bit Robert and I missed in our off-list discussion.
>
> As long as we always get our latch set by the postmaster after a fork
> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
> guaranteed to return BGWH_STOPPED after that, and as long as we only
> ever use latch/CFI loops to wait, and as long as we try to read from a
> shm_mq, then I don't see a failure mode that hangs.
>
> This doesn't help a parallel.c client that doesn't try to read from a
> shm_mq though, like parallel CREATE INDEX (using the proposed design
> without dynamic Barrier).
>

Yes, this is what I am trying to explain on parallel create index
thread. I think there we need to either use
WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a
new API as proposed in that thread) if we don't want to use barriers.
I see a minor disadvantage in using WaitForParallelWorkersToFinish
which I will say on that thread.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-01-24 04:59:19 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Amit Kapila 2018-01-24 04:36:17 Re: [HACKERS] parallel.c oblivion of worker-startup failures