Re: asynchronous and vectorized execution

From: Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: asynchronous and vectorized execution
Date: 2016-05-10 12:05:23
Message-ID: BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09 May 2016 23:04, Robert Haas Wrote:

>2. vectorized execution, by which I mean the ability of a node to return
>tuples in batches rather than one by one. Andres has opined more than
>once that repeated trips through ExecProcNode defeat the ability of the
>CPU to do branch prediction correctly, slowing the whole system down,
>and that they also result in poor CPU cache behavior, since we jump all
>over the place executing a little bit of code from each node before
>moving onto the next rather than running one bit of code first, and then
>another later. I think that's
>probably right. For example, consider a 5-table join where all of
>the joins are implemented as hash tables. If this query plan is going
>to be run to completion, it would make much more sense to fetch, say,
>100 tuples from the driving scan and then probe for all of those in the
>first hash table, and then probe for all of those in the second hash
>table, and so on. What we do instead is fetch one tuple and probe for
>it in all 5 hash tables, and then repeat. If one of those hash tables
>would fit in the CPU cache but all five together will not,
>that seems likely to be a lot worse. But even just ignoring the CPU
>cache aspect of it for a minute, suppose you want to write a loop to
>perform a hash join. The inner loop fetches the next tuple from the
>probe table and does a hash lookup. Right now, fetching the next tuple
>from the probe table means calling a function which in turn calls
>another function which probably calls another function which probably
>calls another function and now about 4 layers down we actually get the
>next tuple. If the scan returned a batch of tuples to the hash join,
>fetching the next tuple from the batch would probably be 0 or 1 function
>calls rather than ... more. Admittedly, you've got to consider the cost
>of marshaling the batches but I'm optimistic that there are cycles to be
>squeezed out here. We might also want to consider storing batches of
>tuples in a column-optimized rather than row-optimized format so that
>iterating through one or two attributes across every tuple in the batch
>touches the minimal number of cache lines.

This sounds to be really great idea in the direction of performance improvement.
I would like to share my thought as per our research work in the similar area (Mostly it may be as you have mentioned).
Our goal with this work was to:
1. Makes the processing data-centric instead of operator centric.
2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed to push until it sees any operator, which cannot be processed without result from other operator.
3. Above two points to make better data-locality.

e.g. we had done some quick prototyping (take it just as thought provoker) as mentioned below:
Query: select * from tbl1, tbl2, tbl3 where tbl1.a=tbl2.b and tbl2.b=tbl3.c;
For hash join:
For each tuple t2 of tbl2
Materialize a hash tbl1.a = tbl2.b

For each tuple t3 of tbl3
Materialize a hash tbl2.b = tbl3.c

for each tuple t1 of tbl1
Search in hash tbl1.a = tbl2.b
Search in hash tbl2.b = tbl3.c
Output t1*t2*t3

Off course at each level if there is any additional Qual for the table, same can be applied.

Similarly for Nested Loop Join, plan tree can be processed in the form of post-order traversal of tree.
Scan first relation (leftmost relation), store all tuple --> Outer
Loop through all scan (Or some part of total tuples)node relation starting from second one
Scan the current relation
For each tuple, match with all tuples of outer result, build the combined tuple.
Save all combined satisfied tuple --> Outer

The result we got was really impressive.

There is a paper by Thomas Neumann on this idea: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

Note: VitesseDB has also implemented this approach for Hash Join along with compilation and their result is really impressive.

Thanks and Regards,
Kumar Rajeev Rastogi.
http://rajeevrastogi.blogspot.in/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-05-10 12:09:02 Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Previous Message Vladimir Gordiychuk 2016-05-10 11:41:58 Re: Stopping logical replication protocol