Re: asynchronous and vectorized execution

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: amitdkhan(dot)pg(at)gmail(dot)com
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: asynchronous and vectorized execution
Date: 2016-09-29 08:50:09
Message-ID: 20160929.175009.181715872.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, thank you for the comment.

At Fri, 23 Sep 2016 18:15:40 +0530, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote in <CAJ3gD9fZ=rtBZ0i1_pxycbkgxi=OzTgv1n0ojkmK318Mcc921A(at)mail(dot)gmail(dot)com>
> On 13 September 2016 at 20:20, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> > On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
> > <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > [ new patches ]
> >
> > + /*
> > + * We assume that few nodes are async-aware and async-unaware
> > + * nodes cannot be revserse-dispatched from lower nodes that
> > is
> > + * async-aware. Firing of an async node that is not a
> > descendant
> > + * of the planstate will cause such reverse-diaptching to
> > + * async-aware nodes, which is unexpected behavior for them.
> > + *
> > + * For instance, consider an async-unaware Hashjoin(OUTER,
> > INNER)
> > + * where the OUTER is running asynchronously but the Hashjoin
> > is
> > + * waiting on the async INNER during inner-hash creation. If
> > the
> > + * OUTER fires for the case, since anyone is waiting on it,
> > + * ExecAsyncWaitForNode finally dispatches to the Hashjoin
> > which
> > + * is now in the middle of thing its work.
> > + */
> > + if (!IsParent(planstate, node))
> > + continue;
> >
> > I'm not entirely sure that I understand this comment, but I don't

Sorry for the read-resistant comment...

> > think it's going in the right direction. Let's start with the example
> > in the second paragraph. If the hash join is async-unaware, then it
> > isn't possible for the hash join to be both running the outer side of
> > the join asynchronously and at the same time waiting on the inner
> > side. Once it tries to pull the first tuple from the outer side, it's
> > waiting for that to finish and can't do anything else. So, the inner
> > side can't possibly get touched in any way until the outer side
> > finishes. For anything else to happen, the hash join would have to be
> > async-aware. Even if we did that, I don't think it would be right to
> > kick off both sides of the hash join at the same time. Right now, if
> > the outer side turns out to be empty, we never need to build the hash
> > table, and that's good.
> >
>
> I feel the !IsParent() condition is actually to prevent the infinite wait
> caused by a re-entrant issue in ExecAsuncWaitForNode() that Kyotaro
> mentioned
> earlier. But yes, the comments don't explain exactly how the hash join can
> cause the re-entrant issue.
>
> But I attempted to come up with some testcase which might reproduce the
> infinite-waiting in ExecAsyncWaitForNode() after removing the !IsParent()
> check
> so that the other subtree nodes are also included, but I couldn't reproduce.
> Kyotaro, is it possible for you to give a testcase that consistently hangs
> if
> we revert back the !IsParent() check ?

I dragged out from my memory that it happened during the
regression test of postgres_fdw, and it still reproducible in the
same manner.

postgres_fdw> make check
...
============== running regression test queries ==============
test postgres_fdw ... FAILED (test process exited with exit code 2)
...

And in server log,

== contrib/postgres_fdw/log/postmaster.log
TRAP: FailedAssertion("!(hashtable == ((void *)0))", File: "nodeHashjoin.c", Line: 123)
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
LOG: server process (PID 9130) was terminated by signal 6: Aborted
DETAIL: Failed process was running: SELECT * FROM ft1 t1 WHERE t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 <= 10) ORDER BY c1;

nodeHashjoin.c:116:
> switch (node->hj_JoinState)
> {
> case HJ_BUILD_HASHTABLE:
>
> /*
> * First time through: build hash table for inner relation.
> */
> Assert(hashtable == NULL);

This is the reentrance of ExecHashJoin.

Instead, after doing installcheck, then connecting to the
database "contrib_regression" after failure, we can see what plan
has been tried.

contrib_regression=# explain SELECT * FROM ft1 t1 WHERE t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 <= 10) ORDER BY c1;
QUERY PLAN

--------------------------------------------------------------------------------
-------
Sort (cost=275.96..277.21 rows=500 width=47)
Sort Key: t1.c1
-> Hash Join (cost=208.78..253.54 rows=500 width=47)
Hash Cond: (t1.c3 = t2.c3)
-> Foreign Scan on ft1 t1 (cost=100.00..141.00 rows=1000 width=47)
-> Hash (cost=108.77..108.77 rows=1 width=6)
-> HashAggregate (cost=108.76..108.77 rows=1 width=6)
Group Key: t2.c3
-> Foreign Scan on ft2 t2 (cost=100.28..108.73 rows=12 wi
dth=6)
(9 rows)

> I was also thinking about another possibility where the same plan state
> node is
> re-entered, as explained below.
>
> >
> > I don't think it's a good idea to wait for only nodes that are in the
> > current subtree. For example, consider a plan like this:
> >
> > Append
> > -> Foreign Scan on a
> > -> Hash Join
> > -> Foreign Scan on b
> > -> Hash
> > -> Seq Scan on x
> >
> > Suppose Append and Foreign Scan are parallel-aware but the other nodes
> > are not. Append kicks off the Foreign Scan on a and then waits for
> > the hash join to produce a tuple; the hash join kicks off the Foreign
> > Scan on b and waits for it to return a tuple. If, while we're waiting
> > for the foreign scan on b, the foreign scan on a needs some attention
> > - either to produce tuples, or maybe just to call PQconsumeInput() so
> > that more data can be sent from the other side, I think we need to be
> > able to do that. There's no real problem here; even if the Append
> > becomes result-ready before the hash join returns, that is fine.
>
>
> Yes I agree : we should be able to do this. Sine we have all the waiting
> events
> in a common estate, there's no harm if we start executing nodes of another
> sub-tree if we get an event from there.
>
> But I am thinking about what would happen when this node from other sub-tree
> returns result_ready, and then it's parents are called, and then the result
> gets bubbled up upto the node which had already caused us to call
> ExecAsyncWaitForNode() in the first place.
>
> For e.g., in the above plan which you specified, suppose :
> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
> is
> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
> 2. The event wait list already has foreign scan on a that is on a different
> subtree.
> 3. This foreign scan a happens to be ready, so in
> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
> which returns with result_ready.
> 4. Since it returns result_ready, it's parent node is now inserted in the
> callbacks array, and so it's parent (Append) is executed.
> 5. But, this Append planstate is already in the middle of executing Hash
> join, and is waiting for HashJoin.

This should be what I wanted to explain by the encrypted commnet:(

> Is this safe to execute the same plan state when it is already inside its
> execution ? In other words, is the plan state re-entrant ? I suspect, the
> new
> execution may even corrupt the structures with which it was already
> executing.

It should be safe for most cases, but HashJoin and some other
nodes have inner state other than descendant nodes. Such nodes
cannot be reentered.

> In usual cases, a tree can contain multiple plan state nodes belonging to
> the
> same plan node, but in this case, we are having the same plan state node
> being
> executed again while it is already executing.
>
> I suspect this can be one reason why Kyotaro might be getting infinite
> recursion issues. May be we need to prevent a plan state node to re-enter,
> but allow nodes from any subtree to execute. So propagate the result upwards
> until we get a node which is already executing.

Sorry for no response, but, the answer is yes. We could be able
to avoid the problem by managing execution state for every
node. But it needs an additional flag in *State structs and
manipulating on the way shuttling up and down around the
execution tree.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message amul sul 2016-09-29 08:54:01 Re: Bug in to_timestamp().
Previous Message Emre Hasegeli 2016-09-29 08:37:30 Re: Floating point comparison inconsistencies of the geometric types