Re: [PoC] Asynchronous execution again (which is not parallel)

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PoC] Asynchronous execution again (which is not parallel)
Date: 2015-12-21 05:07:36
Message-ID: 20151221.140736.196135411.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the comment.

At Tue, 15 Dec 2015 21:01:27 -0500, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+TgmoZuAqVDJQ14YHCa3izbdaaaUSuwrG1YbtJD0rKO5EmeKQ(at)mail(dot)gmail(dot)com>
> On Mon, Dec 14, 2015 at 3:34 AM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Yes, the most significant and obvious (but hard to estimate the
> > benefit) target of async execution is (Merge)Append-ForeignScan,
> > which is narrow but freuquently used. And this patch has started
> > from it.
> >
> > It is because of the startup-heavy nature of FDW. So I involved
> > sort as a target later then redesigned to give the ability on all
> > nodes. If it is obviously over-done for the (currently) expected
> > benefit and if it is preferable to shrink this patch so as to
> > touch only the portion where async-exec has a benefit, I'll do
> > so.
>
> Suppose we equip each EState with the ability to fire "callbacks".
> Callbacks have the signature:
>
> typedef bool (*ExecCallback)(PlanState *planstate, TupleTableSlot
> *slot, void *context);
>
> Executor nodes can register immediate callbacks to be run at the
> earliest possible opportunity using a function like
> ExecRegisterCallback(estate, callback, planstate, slot, context).
> They can registered deferred callbacks that will be called when a file
> descriptor becomes ready for I/O, or when the process latch is set,
> using a call like ExecRegisterFileCallback(estate, fd, event,
> callback, planstate, slot, context) or
> ExecRegisterLatchCallback(estate, callback, planstate, slot, context).
>
> To execute callbacks, an executor node can call ExecFireCallbacks(),
> which will fire immediate callbacks in order of registration, and wait
> for the file descriptors for which callbacks have been registered and
> for the process latch when no immediate callbacks remain but there are
> still deferred callbacks. It will return when (1) there are no
> remaining immediate or deferred callbacks or (2) one of the callbacks
> returns "true".

Excellent! I unconsciously excluded the case of callbacks because
I supposed (without certain ground) all executor nodes can have a
chance to win from this. Such callback is a good choice to do
what Start*Node did in the lastest patch.

> Then, suppose we add a function bool ExecStartAsync(PlanState *target,
> ExecCallback callback, PlanState *cb_planstate, void *cb_context).
> For non-async-aware plan nodes, this just returns false. async-aware
> plan nodes should initiate some work, register some callbacks, and
> return. The callback that get registered should arrange in turn to
> register the callback passed as an argument when a tuple becomes
> available, passing the planstate and context provided by
> ExecStartAsync's caller, plus the TupleTableSlot containing the tuple.

Although I don't imagine clearly about the case of
async-aware-nodes under non-aware-nodes, it seems to have a high
affinity with (true) parallel execution framework.

> So, in response to ExecStartAsync, if there's no tuple currently
> available, postgres_fdw can send a query to the remote server and
> request a callback when the fd becomes ready-ready. It must save the
> callback passed to ExecStartAsync inside the PlanState someplace so
> that when a tuple becomes available it can register that callback.
>
> ExecAppend can call ExecStartAsync on each of its subplans. For any
> subplan where ExecStartAsync returns false, ExecAppend will just
> execute it normally, by calling ExecProcNode repeatedly until no more
> tuples are returned. But for async-capable subplans, it can call
> ExecStartAsync on all of them, and then call ExecFireCallbacks. The
> tuple-ready callback it passes to its child plans will take the tuple
> provided by the child plan and store it into the Append node's slot.
> It will then return true if, and only if, ExecFireCallbacks is being
> invoked from ExecAppend (which it can figure out via some kind of
> signalling either through its own PlanState or centralized signalling
> through the EState). That way, if ExecAppend were itself invoked
> asynchronously, its tuple-ready callback could simply populate a slot
> appropriately register its invoker's tuple-ready callback. Whether
> called synchronously or asynchronously, each invocation of as
> asynchronous append after the first would just need to again
> ExecStartAsync on the child that last returned a tuple.

Thanks for the attentive explanation. My concern about this is
that the latency by synchronizing one by one for every tuple
between the producer and the consumer. My previous patch is not
asynchronous on every tuple so it can give a pure gain without
loss from tuple-wise synchronization. But it looks clean and I
like it so I'll consider this.

> It seems pretty straightforward to fit Gather into this infrastructure.

Yes.

> It is unclear to me how useful this is beyond ForeignScan, Gather, and
> Append. MergeAppend's ordering constraint makes it less useful; we
> can asynchronously kick off the request for the next tuple before
> returning the previous one, but we're going to need to have that tuple
> before we can return the next one. But it could be done. It could
> potentially even be applied to seq scans or index scans using some set
> of asynchronous I/O interfaces, but I don't see how it could be
> applied to joins or aggregates, which typically can't really proceed
> until they get the next tuple. They could be plugged into this
> interface easily enough but it would only help to the extent that it
> enabled asynchrony elsewhere in the plan tree to be pulled up towards
> the root.

This is mainly not an argument on "asynchronous execution/start"
but "asynchronous tuple-passing". As I showed before, a merge
join on asynchronous and parallel children running sort *can* win
over a hash join (if planner foresees that). If asynchronous
tuple-passing is not so effective like MergeAppend, we can
simplly refrain from doing that. But cost modeling for it is a
difficult problem.

> Thoughts?

I'll try the callback framework and in-process asynchronous
tuple-passing (like select(2)). Please wait for a while.

regares,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2015-12-21 05:23:47 Re: A typo in syncrep.c
Previous Message Haribabu Kommi 2015-12-21 04:23:55 Re: Parallel Aggregate