|From:||Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>|
|Subject:||Re: asynchronous and vectorized execution|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
At Tue, 5 Jul 2016 11:39:41 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+TgmobnQ6ZpsubttBYC=pSLQ6d=0GuSgBsUFoaARMrie_75BA(at)mail(dot)gmail(dot)com>
> On Wed, Jun 29, 2016 at 11:00 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> > We may also want to consider handling abstract events such as
> > "tuples-are-available-at-plan-node-X".
> > One benefit is : we can combine this with batch processing. For e.g. in case
> > of an Append node containing foreign scans, its parent node may not want to
> > process the Append node result until Append is ready with at least 1000
> > rows. In that case, Append node needs to raise an event
> > "n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event
> > will wake up the foreign scan, but it may not eventually cause its parent
> > Append node to in turn wake up it's parent.
> Right, I agree. I think this case only arises in parallel query. In
> serial execution, there's not really any way for a plan node to just
> become ready other than an FD or latch event or the parent becoming
> ready. But in parallel query it can happen, of course, because some
> other backend can do work that makes that node ready to produce
> It's not necessarily the case that we have to deal with this in the
> initial patches for this feature, because the most obvious win for
> this sort of thing is when we have an Append of ForeignScan plans.
> Sure, parallel query has interesting cases, too, but a prototype that
> just handles Append over a bunch of postgres_fdw ForeignScans would be
> pretty cool. I suggest that we make that the initial goal here.
This seems to be a good opportunity to show this patch. The
attched patch set does async execution of foreignscan
(postgres_fdw) on the Robert's first infrastructure, with some
ExecAsyncWaitForNode can get into an inifite-waiting by recursive
calls of ExecAsyncWaitForNode caused by ExecProcNode called from
async-unaware nodes. Such recursive calls cause a wait on
I solved that in the patch set by allocating a separate
async-execution context for every async-execution subtrees, which
is made by ExecProcNode, instead of one async-exec context for
the whole execution tree. This works fine but the way switching
contexts seems ugly. This may also be solved by make
ExecAsyncWaitForNode return when no node to wait even if the
waiting node is not ready. This might keep the async-exec context
(state) simpler so I'll try this.
> > How we can do this event abstraction is the other question. We can have one
> > latch for each of the event, and each node would raise its own event by
> > setting the corresponding latch. But I am not sure about latches within a
> > single process as against one process waking up another process. Or else,
> > some internal event structures needs to be present (in estate ?), which then
> > ExecProcNode would use when it does the event looping to wake up (i.e.
> > execute) required nodes.
> I think adding more latches would be a bad idea. What I think we
> should do instead is add two additional data structures to dynamic
> shared memory:
> 1. An array of PGPROC * pointers for all of the workers. Processes
> add their PGPROC * to this array as they start up. Then, parallel.h
> can expose new API ParallelWorkerSetLatchesForGroup(void). In the
> leader, this sets the latch for every worker process for every
> parallel context with which the leader is associated; in a worker, it
> sets the latch for other processes in the parallel group, and the
> leader also.
> 2. An array of executor nodes where one process might do something
> that allows other processes to make progress on that node. This would
> be set up somehow by execParallel.c, which would need to somehow
> figure out which plan nodes want to be included in the list. When an
> executor node does something that might unblock other workers, it
> calls ParallelWorkerSetLatchesForGroup() and the async stuff then
> tries calling all of the nodes in this array again to see if any of
> them now think that they've got tuples to return (or just to let them
> do additional work without returning tuples).
Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
or something like instead of latches?
> > Also, the WaitEvent.user_data field can have some more info besides the plan
> > state. It can have its parent PlanState stored, so that we don't have to
> > have parent field in plan state. It also can have some more data such as
> > "n-tuples-available".
> I don't think that works, because execution may need to flow
> arbitrarily far up the tree. Just knowing the immediate parent isn't
> good enough. If it generates a tuple, then you have to in turn call
> it's parent, and that one then produces a tuple, you have to continue
> on even further up the tree. I think it's going to be very awkward to
> make this work without those parent pointers.
Basically agreed, but going up too far was bad for the reason
> BTW, we also need to benchmark those changes to add the parent
> pointers and change the return conventions and see if they have any
> measurable impact on performance.
I have to bring you a bad news.
With the attached patch, an append on four foreign scans on one
server (at local) performs faster by about 10% and by twice for
three or four foreign scns on separate foreign servers
(connections) respectively, but only when compiled with -O0. I
found that it can take hopelessly small amount of advantage from
compiler optimization, while unpatched version gets faster.
Anyway, the current state of this patch is attached.
For binaries compiled with both -O0 and -O2, ran a simple query
"select sum(a) from <table>" on tables generated by the attached
script, t0, pl, pf0 and pf1 which are a local table, an append on
local tables, an append on foreign tables on the same foreign
server and an append on foreign tables on different foreign
servers respectively. The numbers are the mean values of ten
t0 891.3934 18.74902154
pl 416.3298 47.38902802
pf0 13523.0777 87.45769903
pf1 3376.6415 183.3578028
t0 891.4309 5.245807775
pl 408.2932 1.04260004
pf0 13640.3551 52.52211814
pf1 3470.1739 262.3522963
t0 845.3927 18.98379876
pl 363.4933 4.142091341
pf0 14986.1284 23.07288416
pf1 14961.0596 127.2587286
t0 429.8462 31.51970532
pl 176.3959 0.237832551
pf0 8129.3762 44.68774182
pf1 8211.6319 97.93206675
By the way, running the attached testrun.sh, the result for the
first one or two runs of pf0 is faster by about 30%-50% than the
rest for some reason unknown to me...
NTT Open Source Software Center
|Next Message||Michael Paquier||2016-07-06 07:32:39||Re: Password identifiers, protocol aging and SCRAM protocol|
|Previous Message||Michael Paquier||2016-07-06 07:18:07||Re: Password identifiers, protocol aging and SCRAM protocol|