Re: asynchronous and vectorized execution

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: asynchronous and vectorized execution
Date: 2016-09-13 14:50:30
Message-ID: CA+TgmoacgPkQv9qE1EYpXFwt+JMMxbwqM9_Vax_8N3anBMbRbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> [ new patches ]

+ /*
+ * We assume that few nodes are async-aware and async-unaware
+ * nodes cannot be revserse-dispatched from lower nodes that is
+ * async-aware. Firing of an async node that is not a descendant
+ * of the planstate will cause such reverse-diaptching to
+ * async-aware nodes, which is unexpected behavior for them.
+ *
+ * For instance, consider an async-unaware Hashjoin(OUTER, INNER)
+ * where the OUTER is running asynchronously but the Hashjoin is
+ * waiting on the async INNER during inner-hash creation. If the
+ * OUTER fires for the case, since anyone is waiting on it,
+ * ExecAsyncWaitForNode finally dispatches to the Hashjoin which
+ * is now in the middle of thing its work.
+ */
+ if (!IsParent(planstate, node))
+ continue;

I'm not entirely sure that I understand this comment, but I don't
think it's going in the right direction. Let's start with the example
in the second paragraph. If the hash join is async-unaware, then it
isn't possible for the hash join to be both running the outer side of
the join asynchronously and at the same time waiting on the inner
side. Once it tries to pull the first tuple from the outer side, it's
waiting for that to finish and can't do anything else. So, the inner
side can't possibly get touched in any way until the outer side
finishes. For anything else to happen, the hash join would have to be
async-aware. Even if we did that, I don't think it would be right to
kick off both sides of the hash join at the same time. Right now, if
the outer side turns out to be empty, we never need to build the hash
table, and that's good.

I don't think it's a good idea to wait for only nodes that are in the
current subtree. For example, consider a plan like this:

Append
-> Foreign Scan on a
-> Hash Join
-> Foreign Scan on b
-> Hash
-> Seq Scan on x

Suppose Append and Foreign Scan are parallel-aware but the other nodes
are not. Append kicks off the Foreign Scan on a and then waits for
the hash join to produce a tuple; the hash join kicks off the Foreign
Scan on b and waits for it to return a tuple. If, while we're waiting
for the foreign scan on b, the foreign scan on a needs some attention
- either to produce tuples, or maybe just to call PQconsumeInput() so
that more data can be sent from the other side, I think we need to be
able to do that. There's no real problem here; even if the Append
becomes result-ready before the hash join returns, that is fine. We
will not actually be able to return from the append until the hash
join returns because of what's on the call stack, but that doesn't
mean that the Append can't be marked result-ready sooner than that.
The situation can be improved by making the hash join node
parallel-aware, but if we don't do that it's still not broken.

I think the reason that you originally got backed into this design was
because of problems with reentrancy. I don't think I really
understand in any detail exactly what problem you hit there, but it
seems to me that the following problem could occur:
ExecAsyncWaitForNode finds two events and schedules two callbacks. It
calls the first of those two callbacks. Before that callback returns,
it again calls ExecAsyncWaitForNode. But the new invocation of
ExecAsyncWaitForNode doesn't know that there is a second callback
pending, so it somehow gets confused. However, I think this problem
can fixed using a different method. The occurred_event and callbacks
arrays defined by ExecAsyncWaitForNode can be made part of the EState
rather than being local variables. When ExecAsyncWaitForNode is
called, it checks whether there are any pending callbacks; if so, it
removes and calls the first one. Only if there are no pending
callbacks does it actually wait; when a wait event fires, one or more
new callbacks are generated. This is very similar to the reason why
ReceiveSharedInvalidMessages uses a static messages array rather than
a normal local variable. That function is solving a problem which I
suspect is very similar to the one we have here. However, it would be
helpful if you could provide some more details on what you think the
reentrancy problems are, because I'm not able to figure them out from
your messages so far.

The most mysterious part of this hunk to me is the comment that
"Firing of an async node that is not a descendant of the planstate
will cause such reverse-diaptching to async-aware nodes, which is
unexpected behavior for them." It is the async-unaware nodes which
might have a problem. The nodes that have been taught about the new
system should know what to expect.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-09-13 15:20:00 Re: asynchronous and vectorized execution
Previous Message Amit Kapila 2016-09-13 14:34:15 Re: _hash_alloc_buckets() safety