Re: asynchronous and vectorized execution

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: asynchronous and vectorized execution
Date: 2016-05-10 08:50:39
Message-ID: 20160510.175039.122461151.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

At Mon, 9 May 2016 13:33:55 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww(at)mail(dot)gmail(dot)com>
> Hi,
>
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there. I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who). To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor. It's too
> slow and it doesn't do everything we want it to do. The main things
> on my mind are:

> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits. This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.

This is my main concern and what I wanted to solve.

> It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS. Whether or not this will be efficient is a
> research question, but it can be done. However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer. In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.

Especially for foreign tables, there must be gaps between sending
FETCH and getting the result. Visiting other tables is very
effective to fill the gaps. Using file descriptors is greatly
helps this in effective way, thanks to the new API
WaitEventSet. The attached is a WiP of PoC (sorry for including
some debug code and irrelevant code) of that. It is a bit
different in Exec* APIs from the 0002 patch but works even only
for postgres-fdw and append. It embeds waiting code into
ExecAppend but easily replaceable with the framework in the
Robert's 0003 patch.

Apart from the core part, for postgres-fdw, some scans resides
together on one connection. These scans share the same FD but
there's no means to identify for which scan-node the fd is
signalled. To handle the situation, we might need 'seemed to be
ready but really not' route.

> For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.

One annoyance of this method is one FD with latch-like data
drain. Since we should provide FDs for such nodes, gather would
may have another data-passing channel on the FDs.

And I want to realize early-execution of async nodes. This might
need that all types of node return 'not-ready' for the first call
even if it is async-capable.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one. Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later. I think that's
> probably right. For example, consider a 5-table join where all of
> the joins are implemented as hash tables. If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on. What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat. If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse. But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join. The inner loop fetches the next tuple from the
> probe table and does a hash lookup. Right now, fetching the next
> tuple from the probe table means calling a function which in turn
> calls another function which probably calls another function which
> probably calls another function and now about 4 layers down we
> actually get the next tuple. If the scan returned a batch of tuples
> to the hash join, fetching the next tuple from the batch would
> probably be 0 or 1 function calls rather than ... more. Admittedly,
> you've got to consider the cost of marshaling the batches but I'm
> optimistic that there are cycles to be squeezed out here. We might
> also want to consider storing batches of tuples in a column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.
>
> Obviously, both of these are big projects that could touch a large
> amount of executor code, and there may be other ideas, in addition to
> these, which some of you may be thinking about that could also touch a
> large amount of executor code. It would be nice to agree on a way
> forward that minimizes code churn and maximizes everyone's attempt to
> contribute without conflicting with each other. Also, it seems
> desirable to enable, as far as possible, incremental development - in
> particular, it seems to me that it would be good to pick a design that
> doesn't require massive changes to every node all at once. A single
> patch that adds some capability to every node in the executor in one
> fell swoop is going to be too large to review effectively.
>
> My proposal for how to do this is to make ExecProcNode function as a
> backward-compatibility wrapper. For asynchronous execution, a node
> might return a not-ready-yet indication, but if that node is called
> via ExecProcNode, it means the caller isn't prepared to receive such
> an indication, so ExecProcNode will just wait for the node to become
> ready and then return the tuple. Similarly, for vectorized execution,
> a node might return a bunch of tuples all at once. ExecProcNode will
> extract the first one and return it to the caller, and subsequent
> calls to ExecProcNode will iterate through the rest of the batch, only
> calling the underlying node-specific function when the batch is
> exhausted. In this way, code that doesn't know about the new stuff
> can continue to work pretty much as it does today. Also, and I think
> this is important, nodes don't need the permission of their parent
> node to use these new capabilities. They can use them whenever they
> wish, without worrying about whether the upper node is prepared to
> deal with it. If not, ExecProcNode will paper over the problem. This
> seems to me to be a good way to keep the code simple.

Agreed to returning not-ready state and wrapping nodes to
disguise old-style API, but I suppose Exec* may return a tuple as
it does corrently.

> For asynchronous execution, I have gone so far as to mock up a bit of
> what this might look like. This shouldn't be taken very seriously at
> this point, but I'm attaching a few very-much-WIP patches to show the
> direction of my line of thinking. Basically, I propose to have
> ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
> by putting them into a new PlanState member called "result", which is
> just a Node * so that we can support multiple types of results,
> instead of returning them. There is also a result_ready boolean, so
> that a node can return without setting this Boolean to engage
> asynchronous behavior. This triggers an "event loop", which
> repeatedly waits for FDs chosen by waiting nodes to become readable
> and/or writeable and then gives the node a chance to react.
> Eventually, the waiting node will stop waiting and have a result
> ready, at which point the event loop will give the parent of that node
> a chance to run. If that node consequently becomes ready, then its
> parent gets a chance to run. Eventually (we hope), the node for which
> we're waiting becomes ready, and we can then read a result tuple.

I thought almost the same, even only for AppendNode..

> With some more work, this seems like it can handle the FDW case, but I
> haven't worked out how to make it handle the related parallel query
> case. What we want there is to wait not for the readiness of an FD
> but rather for some other process involved in the parallel query to
> reach a point where it can welcome assistance executing that node. I
> don't know exactly what the signaling for that should look like yet -
> maybe setting the process latch or something.

Agreed as described above.

> By the way, one smaller executor project that I think we should also
> look at has to do with this comment in nodeSeqScan.c:
>
> static bool
> SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
> {
> /*
> * Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
> * (and this is very bad) - so, here we do not check are keys ok or not.
> */
> return true;
> }
>
> Some quick prototyping by my colleague Dilip Kumar suggests that, in
> fact, there are cases where pushing down keys into heap_beginscan()
> could be significantly faster. Some care is required here because any
> functions we execute as scan keys are run with the buffer locked, so
> we had better not run anything very complicated. But doing this for
> simple things like integer equality operators seems like it could save
> quite a few buffer lock/unlock cycles and some other executor overhead
> as well.

The cost of pushing-down keys on seqscans seems calucalatable
with a maybe-small amount of computation. So I suppose it is
promising.

> Thoughts, ideas, suggestions, etc. very welcome.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
PoC-async-exec-horiguchi-20160510.diff text/x-patch 39.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2016-05-10 10:28:57 Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Previous Message Benedikt Grundmann 2016-05-10 08:42:42 Re: between not propated into a simple equality join