parallel.c oblivion of worker-startup failures

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: parallel.c oblivion of worker-startup failures
Date: 2017-09-18 04:26:49
Message-ID: CAA4eK1KDfKkvrjxsKJi3WPyceVi3dH1VCkbTJji2fuwKuB=3uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sometime back Tom Lane has reported [1] about $Subject. I have looked
into the issue and found that the problem is not only with parallel
workers but with general background worker machinery as well in
situations where fork or some such failure occurs. The first problem
is that after we register the dynamic worker, the way to know whether
the worker has started
(WaitForBackgroundWorkerStartup/GetBackgroundWorkerPid) won't give the
right answer if the fork failure happens. Also, in cases where the
worker is marked not to start after the crash, postmaster doesn't
notify the backend if it is not able to start the worker which can
make the backend wait forever as it is oblivion of the fact that the
worker is not started. Now, apart from these general problems of
background worker machinery, parallel.c assumes that after it has
registered the dynamic workers, they will start and perform their
work. We need to ensure that in case, postmaster is not able to start
parallel workers due to fork failure or any similar situations,
backend doesn't keep on waiting forever. To fix it, before waiting
for workers to finish, we can check whether the worker exists at all.
Attached patch fixes these problems.

Another way to fix the parallel query related problem is that after
registering the workers, the master backend should wait for workers to
start before setting up different queues for them to communicate. I
think that can be quite expensive.

Thoughts?

[1] - https://www.postgresql.org/message-id/4905.1492813727@sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
fix_worker_startup_failures_v1.patch application/octet-stream 2.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2017-09-18 04:48:24 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Thomas Munro 2017-09-18 03:07:22 Re: Automatic testing of patches in commit fest