Re: asynchronous and vectorized execution

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: amitdkhan(dot)pg(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: asynchronous and vectorized execution
Date: 2016-07-06 07:29:14
Message-ID: 20160706.162914.131241734.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hello,

At Tue, 5 Jul 2016 11:39:41 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+TgmobnQ6ZpsubttBYC=pSLQ6d=0GuSgBsUFoaARMrie_75BA(at)mail(dot)gmail(dot)com>
> On Wed, Jun 29, 2016 at 11:00 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> > We may also want to consider handling abstract events such as
> > "tuples-are-available-at-plan-node-X".
> >
> > One benefit is : we can combine this with batch processing. For e.g. in case
> > of an Append node containing foreign scans, its parent node may not want to
> > process the Append node result until Append is ready with at least 1000
> > rows. In that case, Append node needs to raise an event
> > "n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event
> > will wake up the foreign scan, but it may not eventually cause its parent
> > Append node to in turn wake up it's parent.
>
> Right, I agree. I think this case only arises in parallel query. In
> serial execution, there's not really any way for a plan node to just
> become ready other than an FD or latch event or the parent becoming
> ready. But in parallel query it can happen, of course, because some
> other backend can do work that makes that node ready to produce
> tuples.
>
> It's not necessarily the case that we have to deal with this in the
> initial patches for this feature, because the most obvious win for
> this sort of thing is when we have an Append of ForeignScan plans.
> Sure, parallel query has interesting cases, too, but a prototype that
> just handles Append over a bunch of postgres_fdw ForeignScans would be
> pretty cool. I suggest that we make that the initial goal here.

This seems to be a good opportunity to show this patch. The
attched patch set does async execution of foreignscan
(postgres_fdw) on the Robert's first infrastructure, with some
modification.

ExecAsyncWaitForNode can get into an inifite-waiting by recursive
calls of ExecAsyncWaitForNode caused by ExecProcNode called from
async-unaware nodes. Such recursive calls cause a wait on
already-ready nodes.

I solved that in the patch set by allocating a separate
async-execution context for every async-execution subtrees, which
is made by ExecProcNode, instead of one async-exec context for
the whole execution tree. This works fine but the way switching
contexts seems ugly. This may also be solved by make
ExecAsyncWaitForNode return when no node to wait even if the
waiting node is not ready. This might keep the async-exec context
(state) simpler so I'll try this.

> > How we can do this event abstraction is the other question. We can have one
> > latch for each of the event, and each node would raise its own event by
> > setting the corresponding latch. But I am not sure about latches within a
> > single process as against one process waking up another process. Or else,
> > some internal event structures needs to be present (in estate ?), which then
> > ExecProcNode would use when it does the event looping to wake up (i.e.
> > execute) required nodes.
>
> I think adding more latches would be a bad idea. What I think we
> should do instead is add two additional data structures to dynamic
> shared memory:
>
> 1. An array of PGPROC * pointers for all of the workers. Processes
> add their PGPROC * to this array as they start up. Then, parallel.h
> can expose new API ParallelWorkerSetLatchesForGroup(void). In the
> leader, this sets the latch for every worker process for every
> parallel context with which the leader is associated; in a worker, it
> sets the latch for other processes in the parallel group, and the
> leader also.
>
> 2. An array of executor nodes where one process might do something
> that allows other processes to make progress on that node. This would
> be set up somehow by execParallel.c, which would need to somehow
> figure out which plan nodes want to be included in the list. When an
> executor node does something that might unblock other workers, it
> calls ParallelWorkerSetLatchesForGroup() and the async stuff then
> tries calling all of the nodes in this array again to see if any of
> them now think that they've got tuples to return (or just to let them
> do additional work without returning tuples).

Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
or something like instead of latches?

> > Also, the WaitEvent.user_data field can have some more info besides the plan
> > state. It can have its parent PlanState stored, so that we don't have to
> > have parent field in plan state. It also can have some more data such as
> > "n-tuples-available".
>
> I don't think that works, because execution may need to flow
> arbitrarily far up the tree. Just knowing the immediate parent isn't
> good enough. If it generates a tuple, then you have to in turn call
> it's parent, and that one then produces a tuple, you have to continue
> on even further up the tree. I think it's going to be very awkward to
> make this work without those parent pointers.

Basically agreed, but going up too far was bad for the reason
above.

> BTW, we also need to benchmark those changes to add the parent
> pointers and change the return conventions and see if they have any
> measurable impact on performance.

I have to bring you a bad news.

With the attached patch, an append on four foreign scans on one
server (at local) performs faster by about 10% and by twice for
three or four foreign scns on separate foreign servers
(connections) respectively, but only when compiled with -O0. I
found that it can take hopelessly small amount of advantage from
compiler optimization, while unpatched version gets faster.

Anyway, the current state of this patch is attached.

For binaries compiled with both -O0 and -O2, ran a simple query
"select sum(a) from <table>" on tables generated by the attached
script, t0, pl, pf0 and pf1 which are a local table, an append on
local tables, an append on foreign tables on the same foreign
server and an append on foreign tables on different foreign
servers respectively. The numbers are the mean values of ten
times run.

average(ms) stddev
patched-O0
t0 891.3934 18.74902154
pl 416.3298 47.38902802
pf0 13523.0777 87.45769903
pf1 3376.6415 183.3578028

patched-O2:
t0 891.4309 5.245807775
pl 408.2932 1.04260004
pf0 13640.3551 52.52211814
pf1 3470.1739 262.3522963

not-patched-O0
t0 845.3927 18.98379876
pl 363.4933 4.142091341
pf0 14986.1284 23.07288416
pf1 14961.0596 127.2587286

not-patched-O2
t0 429.8462 31.51970532
pl 176.3959 0.237832551
pf0 8129.3762 44.68774182
pf1 8211.6319 97.93206675

By the way, running the attached testrun.sh, the result for the
first one or two runs of pf0 is faster by about 30%-50% than the
rest for some reason unknown to me...

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patch text/x-patch 21.4 KB
0002-Modify-PlanState-to-have-result-result_ready-fields..patch text/x-patch 67.4 KB
0003-Lightweight-framework-for-waiting-for-events.patch text/x-patch 15.2 KB
0004-Fix-async-execution-framework.patch text/x-patch 17.9 KB
0005-Add-new-fdwroutine-AsyncConfigureWait-and-ShutdownFo.patch text/x-patch 3.6 KB
0006-Make-postgres_fdw-async-capable.patch text/x-patch 40.6 KB
0007-Make-Append-node-async-aware.patch text/x-patch 5.2 KB
unknown_filename text/plain 2.3 KB
unknown_filename text/plain 250 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-07-06 07:32:39 Re: Password identifiers, protocol aging and SCRAM protocol
Previous Message Michael Paquier 2016-07-06 07:18:07 Re: Password identifiers, protocol aging and SCRAM protocol