Re: parallel mode and parallel contexts

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-21 18:08:00
Message-ID: CA+TgmoZ0tOPvfuz5BmEUqmdQ9xYQHzkG4jjJXduYDgQJXmPFbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> You'd need some kind of
>> API that says "pretend I'm waiting for this lock, but don't really
>> wait for it", and you'd need to be darn sure that you removed yourself
>> from the wait queue again before doing any other heavyweight lock
>> manipulation. Do you have specific thoughts on how to implement this?
>
> I've thought some about this, and I think it's a bit easier to not do it
> on the actual lock waitqueues, but teach deadlock.c about that kind of
> blocking.
>
> deadlock.c is far from simple, and at least I don't find the control
> flow to be particularly clear. So it's not easy. It'd be advantageous
> to tackle things at that level because it'd avoid the need to acquire
> locks on some lock's waitqueue when blocking; we're going to do that a
> lot.
>
> But It seems to me that it should be possible to suceed: In
> FindLockCycleRecurse(), in the case that we're not waiting for an actual
> lock (checkProc->links.next == NULL) we can add a case that considers
> the 'blocking parallelism' case. ISTM that that's just a
> FindLockCycleRecurse() call on the process that we're waiting for. We'd
> either have to find the other side's locktag for DEADLOCK_INFO or invent
> another field in there; but that seems like a solveable problem.

I (finally) had some time to look at this today. Initially, it looked
good. Unfortunately, the longer I looked at it, the less convinced I
got that we could solve the problem this way. The core of the issue
is that the Funnel node in the parallel group leader basically does
this:

while (!local_scan_done || !remote_scan_done)
{
attempt a read from each remaining worker's tuple queue, blocking
only if local_scan_done;
if (we got a tuple)
return it;
else if (there are no remaining workers)
remote_scan_done = true;

attempt to produce a tuple just as if we were a worker ourselves;
if (we got a tuple)
return it;
else
local_scan_done = true;
}

Imagine that we're doing a parallel sequential scan; each worker
claims one page but goes into the tank before it has returned all of
the tuples on that page. The master reads all the remaining pages but
must now wait for the workers to finish returning the tuples on the
pages they claimed.

So what this means is:

1. The master doesn't actually *wait* until the very end of the parallel phase.
2. When the master does wait, it waits for all of the parallel workers
at once, not each one individually.

So, I don't think anything as simplistic as teaching a blocking
shm_mq_receive() to tip off the deadlock detector that we're waiting
for the process on the other end of that particular queue can ever
work. Because of point #2, that never happens. When I first started
thinking about how to fix this, I said, well, that's no big deal, we
can just advertise the whole list of processes that we're waiting for
in shared memory, rather than just the one. This is a bit tricky,
though. Any general API for any backend to advertise that it's waiting
for an arbitrary subset of the other backends would require O(n^2)
shared memory state. That wouldn't be completely insane, but it's not
great, either. For this particular case, we could optimize that down
to O(n) by just chaining all of the children of a given parallel group
leader in a linked whose nodes are inlined in their PGPROCs, but that
doesn't feel very general, because it forecloses the possibility of
the children ever using that API, and I think they might need to. If
nothing else, they might need to advertise that they're blocking on
the master if they are trying to send data, the queue is full, and
they have to wait for the master to drain some of it before they can
proceed.

After thinking about it a bit more, I realized that even if we settle
on some solution to that problem, there's another issues: the
wait-edges created by this system don't really have the same semantics
as regular lock waits. Suppose A is waiting on a lock held by B and B
is waiting for a lock held by A; that's a deadlock. But if A is
waiting for B to write to a tuple queue and B is waiting for A to read
from a tuple queue, that's not a deadlock if the queues in question
are the same. If they are different queues, it might be a deadlock,
but then again maybe not. It may be that A is prepared to accept B's
message from one queue, and that upon fully receiving it, it will do
some further work that will lead it to write a tuple into the other
queue. If so, we're OK; if not, it's a deadlock. I'm not sure
whether you'll want to argue that that is an implausible scenario, but
I'm not too sure it is. The worker could be saying "hey, I need some
additional piece of your backend-local state in order to finish this
computation", and the master could then provide it. I don't have any
plans like that, but it's been suggested previously by others, so it's
not an obviously nonsensical thing to want to do.

A further difference in the semantics of these wait-edges is that if
process A is awaiting AccessExclusiveLock on a resource held by B, C,
and D at AccessShareLock, it needs to wait for all of those processes
to release their locks before it can do anything else. On the other
hand, if process A is awaiting tuples from B, C, and D, it just needs
ONE of those processes to emit tuples in order to make progress. Now
maybe that doesn't make any difference in practice, because even if
two of those processes are making lively progress and A is receiving
tuples from them and processing them like gangbusters, that's probably
not going to help the third one get unstuck. If we adopt the approach
of discounting that possibility, then as long as the parallel leader
can generate tuples locally (that is, local_scan_done = false) we
don't report the deadlock, but as soon as it can no longer do that
(local_scan_done = true) then we do, even though we could still
theoretically read more tuples from the non-stuck workers. So then
you have to wonder why we're not solving problem #1, because the
deadlock was just as certain before we generated the maximum possible
number of tuples locally as it was afterwards.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-04-21 19:14:56 Re: parallel mode and parallel contexts
Previous Message Bruce Momjian 2015-04-21 16:26:36 Re: PATCH: Add 'pid' column to pg_replication_slots