Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-01-29 13:52:17
Message-ID: CAEepm=1YGGXH5iDf0eivC7FQrxKgn5ERqubeHq7hNPHKNXkcpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Stepping back a bit, I am aware of the following approaches to hash
> join parallelism:
>
> 1. Run the inner plan and build a private hash table in each
> participant [...].
>
> 2. Run a partition-wise hash join[1]. [...]
>
> 3. Repartition the data on the fly, and then run a partition-wise
> hash join. [...]
>
> 4. Scatter both the inner and outer plans arbitrarily across
> participants [...], and build a shared hash
> table. [...]
>
> [...] I suspect that 4 is probably a better
> fit than 3 for Postgres today, because the communication overhead of
> shovelling nearly all tuples through extra tuple queues to route them
> to the right hash table would surely be very high, though I can see
> that it's very attractive to have a reusable tuple repartitioning
> operator and then run k disjoint communication-free joins (again,
> without code change to the join operator, and to the benefit of all
> join operators).

On this topic, I recently stumbled on the 2011 paper "Design and
Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs"[1]
and found it reassuring. It compares simple shared hash tables to
some state-of-the-art repartitioning approaches (including the radix
join algorithm which performs the amazing feat of building a lot of
cacheline-sized hash tables and then runs with minimal cache misses).

From the introduction:

"Second, we show that an algorithm that does not do any partitioning,
but simply constructs a single shared hash table on the build relation
often outperforms more complex algorithms. This simple
“no-partitioning” hash join algorithm is robust to sub-optimal
parameter choices by the optimizer, and does not require any knowledge
of the characteristics of the input to work well. To the best of our
knowledge, this simple hash join technique differs from what is
currently implemented in existing DBMSs for multi-core hash join
processing, and offers a tantalizingly simple, efficient, and robust
technique for implementing the hash join operation."

"Finally, we show that the simple “no-partitioning” hash join
algorithm takes advantage of intrinsic hardware optimizations to
handle skew. As a result, this simple hash join technique often
benefits from skew and its relative performance increases as the skew
increases! This property is a big advancement over the
state-of-the-art methods, as it is important to have methods that can
gracefully handle skew in practice [8]."

(That relates to SMT pipelining compensating for the extra cacheline
misses during probing by doing thread A's work while waiting for
thread B's memory to be fetched.)

From the conclusion:

"... Our results show that a simple hash join technique that does not
do any partitioning of the input relations often outperforms the other
more complex partitioning-based join alternatives. In addition, the
relative performance of this simple hash join technique rapidly
improves with increasing skew, and it outperforms every other
algorithm in the presence of even small amounts of skew."

For balance, the authors of a 2013 paper "Main-Memory Hash Joins on
Multi-Core CPUs: Tuning to the Underlying Hardware"[2] are less keen
on the simple "hardware-oblivious" "no partitioning" approach and
don't buy the other paper's ideas about SMT. Incidentally, their
results on the benefits of large (huge) pages are interesting (table
VI) and suggest that huge page support for DSM segments could be good
here.

[1] https://pdfs.semanticscholar.org/9de4/b32f2c7b630a4f6aae6994a362a46c7c49e9.pdf
[2] https://www.inf.ethz.ch/personal/cagri.balkesen/publications/parallel-joins-icde13.pdf

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-29 14:00:42 Re: Should buffer of initialization fork have a BM_PERMANENT flag
Previous Message Mithun Cy 2017-01-29 09:13:47 Re: Cache Hash Index meta page.