Re: [HACKERS] parallel.c oblivion of worker-startup failures

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date: 2017-12-20 04:28:14
Message-ID: CAA4eK1JF0J5_z0sKYygRJakm9a4x9YhB=OSp5sL5U6o8rta=qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 19, 2017 at 9:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> I think it would have been much easier to fix this problem if we would
>> have some way to differentiate whether the worker has stopped
>> gracefully or not. Do you think it makes sense to introduce such a
>> state in the background worker machinery?
>
> I think it makes a lot more sense to just throw an ERROR when the
> worker doesn't shut down cleanly, which is currently what happens in
> nearly all cases. It only fails to happen for fork() failure and
> other errors that happen very early in startup. I don't think there's
> any point in trying to make this code more complicated to cater to
> such cases. If fork() is failing, the fact that parallel query is
> erroring out rather than running with fewer workers is likely to be a
> good thing. Your principle concern in that situation is probably
> whether your attempt to log into the machine and kill some processes
> is also going to die with 'fork failure', and having PostgreSQL
> consume every available process slot is not going to make that easier.
> On the other hand, if workers are failing so early in startup that
> they never attach to the error queue, then they're probably all
> failing the same way and trying to cope with that problem in any way
> other than throwing an error is going to result in parallelism being
> silently disabled with no notification to the user, which doesn't seem
> good to me either.
>

That is not the main point I am bothered about. I am concerned that
the patch proposed by you can lead to hang if there is a crash (abrupt
failure like proc_exit(1)) after attaching to the error queue. This is
explained in my email up thread [1]. I think we can fix it by trying
to read the queues as mentioned by you, but was not sure if that is
the best way to deal with it, so proposed an alternate solution.

[1] - https://www.postgresql.org/message-id/CAA4eK1LkHBbafduA8em1fb-in4kDBgmQf6qO5TV8CfyuTzEFaQ%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-12-20 04:30:47 Re: User defined data types in Logical Replication
Previous Message Amit Kapila 2017-12-20 04:17:10 Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com