Re: [HACKERS] parallel.c oblivion of worker-startup failures

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date: 2017-12-21 11:46:11
Message-ID: CAA4eK1+iw=4W1hBkKqTqvoOo4h_4srz+w1qE+t9cUQJTt5kBDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 21, 2017 at 2:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> That is not the main point I am bothered about. I am concerned that
>> the patch proposed by you can lead to hang if there is a crash (abrupt
>> failure like proc_exit(1)) after attaching to the error queue. This is
>> explained in my email up thread [1]. I think we can fix it by trying
>> to read the queues as mentioned by you, but was not sure if that is
>> the best way to deal with it, so proposed an alternate solution.
>
> OK. My core point is that I really, really don't want Gather or
> Gather Merge or parallel CREATE INDEX to have to worry about these
> cases; I believe strongly that it is best if these cases cause an
> error at the ParallelContext layer. I read your email as suggesting
> the opposite, but maybe I misunderstood you.
>
> One thing I thought about is introducing a new state
> BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED. However, it
> seems problematic. Once the worker is unregistered we wouldn't know
> which one to return, especially if the slot has been reused.
>

What if we don't allow to reuse such slots till the backend/session
that has registered it performs unregister? Currently, we don't seem
to have an API corresponding to Register*BackgroundWorker() which can
be used to unregister, but maybe we can provide such an API.

> Furthermore, it doesn't help in the case where the worker starts and
> immediately exits without attaching to the DSM.
>

Yeah, but can't we detect that case? After the worker exits, we can
know its exit status as is passed to CleanupBackgroundWorker, we can
use that to mark the worker state as BGWH_ERROR_STOPPED (or something
like BGWH_IMMEDIATE_STOPPED).

I think above way sounds invasive, but it seems to me that it can be
used by other users of background workers as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-12-21 12:04:48 Re: [HACKERS] path toward faster partition pruning
Previous Message Thomas Munro 2017-12-21 11:46:09 Re: condition variable cleanup and subtransactions