Quick Links

Re: Parallel Hash take II

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject:	Re: Parallel Hash take II
Date:	2017-10-31 04:02:37
Message-ID:	CAEepm=2158-7OJ-kHe+p4bqFkwmx1wOr7BiwqHBHCEBMtLbjbQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
>> I'll report on performance separately.
>
> Looking forward to that ;)

Here are some experimental results from a Xeon E5-2695 v3 with a ton
of RAM and spinning disks (EDB lab server "scylla"). I used TPC-H
dbgen scale 10 with the additional indexes suggested by Tomas
Vondra[1]. 10GB of source data (= 23GB pgdata dir) is obviously quite
small as these things go, and I hope we'll run some of these with a
much larger scale soon (it's a big job), but it's enough to runs
queries for tens of seconds to minutes so it's definitely in parallel
query territory and shows some pretty interesting effects IMHO.

First, here's a stupid self-join as a warm-up. The query is SELECT
COUNT(*) FROM lineitem r JOIN lineitem s USING (l_orderkey,
l_linenumber), where lineitem is ~60 million rows.

(1) With work_mem set sky-high so no batching is required, how much
speed-up does each worker contribute with this feature off (= same as
unpatched master) and on? In this table, each cell shows the speed-up
compared to w=0 (no workers):

parallel_hash | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6
---------------+-------+-------+-------+-------+-------+-------+-------
off | 1.00x | 1.42x | 1.66x | 1.76x | 1.66x | 1.68x | 1.69x
on | 1.00x | 1.76x | 2.54x | 3.21x | 3.80x | 4.30x | 4.79x

(2) With work_mem set to 128MB this query needs 32 batches. Again,
how much speed-up does each worker contribute with the feature off and
on?

parallel_hash | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6
---------------+-------+-------+-------+-------+-------+-------+-------
off | 1.00x | 1.37x | 1.49x | 1.32x | 1.67x | 1.63x | 1.64x
on | 1.00x | 1.94x | 2.72x | 2.82x | 3.02x | 4.64x | 5.98x

I haven't tried to grok the shape of that curve yet. Interestingly
(not shown here) the 32 batch parallel hash actually starts to beat
the single-batch parallel hash somewhere around 5-6 workers, and at 15
workers it achieves 9.53x speed-up compared to w=0 and is still
gaining as you add more workers, whereas the single-batch version tops
out around 8 workers. This may be in part due to the trade-offs
discussed in "Design and Evaluation of Main Memory Hash Join
Algorithms for Multi-core CPUs" (short version: partitioning up front
can pay off by reducing cache misses at various levels and some
research databases would consider that), but I would think we're
probably pretty far away from that frontier and there other probably
other more basic problems. Investigation/profiling required.

Next, here are some numbers from the TPC-H queries. I included only
queries where a Parallel Hash was selected by the planner. I stopped
at w=6 because that's the highest number of workers the planner would
pick by default at that scale. (If I'm getting the maths right, TPC-H
scale 300's lineitem table should inspire about 10 workers to get out
of bed; you get an extra worker each time a table triples in size.)

(3) What is the speed-up with enable_parallel_hash = on vs
enable_parallel_hash = off? Here is the result table for various
numbers of workers, with work_mem set to 1GB.

query | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6
-------+-------+-------+-------+-------+-------+-------+-------
3 | 1.02x | 1.16x | 1.37x | 1.79x | 1.95x | 2.29x | 2.44x
5 | 1.03x | 1.15x | 1.20x | 1.44x | 1.95x | 2.05x | 1.34x
7 | 1.02x | 1.26x | 1.54x | 2.18x | 2.57x | 1.25x | 1.32x
8 | 1.00x | 1.56x | 1.49x | 1.47x | 1.40x | 0.55x | 0.55x
9 | 1.02x | 1.24x | 1.35x | 1.50x | 1.59x | 1.82x | 1.82x
10 | 1.02x | 1.16x | 1.19x | 1.44x | 1.51x | 1.75x | 1.83x
12 | 1.01x | 1.22x | 1.53x | 0.72x | 0.74x | 0.73x | 0.99x
14 | 1.00x | 1.08x | 1.18x | 1.33x | 1.41x | 1.54x | 1.52x
16 | 1.01x | 1.22x | 1.10x | 1.10x | 1.10x | 1.11x | 1.10x
18 | 0.99x | 1.07x | 1.05x | 0.99x | 0.99x | 0.99x | 1.03x
21 | 1.00x | 1.28x | 1.24x | 1.34x | 0.18x | 0.19x | 0.23x

Some commentary on the cases where the performance was apparently hurt
by the feature: for Q21 with w=3 workers and above with
enable_parallel_hash = off the planner switched from a hash join to a
nested loop and that turned out to be better, but with
enable_parallel_hash = on it didn't give up on hash join until there
were 6 workers. Something similar happened with Q8 around 5 workers.
Q21 has some major cardinality estimation problems as discussed
elsewhere, and on this run I didn't think to apply the patch that
fixes (?) that. In other words, as far as I can tell, all of those
are cases where there is possibly room for general planner improvement
outside this project: the point at which we flip from one plan type to
another moves around, not necessarily indicating a problem with
Parallel Hash as an executor node. That isn't to say I'm not
interested in understanding the causes better and trying to fix them
if I can.

(4) The same comparison, with work_mem set to 128MB resulting in more batching:

query | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6
-------+-------+-------+-------+-------+-------+-------+-------
3 | 1.03x | 1.23x | 1.44x | 1.76x | 1.97x | 2.23x | 2.55x
5 | 1.01x | 1.07x | 1.25x | 1.44x | 1.79x | 2.05x | 1.31x
7 | 1.02x | 1.42x | 1.73x | 1.26x | 1.20x | 1.28x | 1.33x
8 | 1.01x | 1.57x | 1.51x | 1.49x | 1.41x | 0.55x | 0.52x
9 | 0.99x | 1.14x | 1.43x | 1.56x | 1.82x | 1.96x | 2.06x
10 | 1.02x | 1.08x | 1.24x | 1.38x | 1.51x | 1.54x | 1.65x
12 | 1.02x | 1.02x | 0.71x | 0.73x | 0.73x | 0.99x | 0.99x
14 | 1.03x | 1.06x | 1.19x | 1.37x | 1.59x | 1.58x | 1.59x
16 | 1.00x | 1.21x | 1.10x | 1.09x | 1.13x | 1.12x | 1.12x
18 | 0.98x | 1.22x | 1.28x | 1.21x | 1.10x | 0.98x | 0.95x
21 | 0.96x | 1.25x | 1.56x | 0.41x | 0.41x | 0.87x | 1.18x

Similar, with the inflection points moving around a bit.

(5) Another way to look at the data is to see how much speed-up each
new worker gives you with and without this feature, as I did for the
self-join above. In this table, there are two lines for each query.
The first line shows the speed-up as we add more workers with
enable_parallel_hash = off (should be same as unpatched master), and
the second line shows the speed-up as we add more workers, with
enable_parallel_hash = on.

query | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6
-------+-------+-------+-------+-------+-------+-------+-------
3 | 1.00x | 1.60x | 2.00x | 2.07x | 2.27x | 2.23x | 2.39x
3 | 1.00x | 1.83x | 2.68x | 3.64x | 4.35x | 5.02x | 5.72x
-------+-------+-------+-------+-------+-------+-------+-------
5 | 1.00x | 1.58x | 2.14x | 2.36x | 2.30x | 2.57x | 8.68x
5 | 1.00x | 1.75x | 2.49x | 3.29x | 4.34x | 5.09x | 11.28x
-------+-------+-------+-------+-------+-------+-------+-------
7 | 1.00x | 1.44x | 1.75x | 1.61x | 1.67x | 4.02x | 4.35x
7 | 1.00x | 1.78x | 2.66x | 3.44x | 4.24x | 4.93x | 5.64x
-------+-------+-------+-------+-------+-------+-------+-------
8 | 1.00x | 1.19x | 1.28x | 1.31x | 1.36x | 3.30x | 3.34x
8 | 1.00x | 1.85x | 1.90x | 1.93x | 1.91x | 1.82x | 1.85x
-------+-------+-------+-------+-------+-------+-------+-------
9 | 1.00x | 1.59x | 2.19x | 2.52x | 2.81x | 2.76x | 2.74x
9 | 1.00x | 1.94x | 2.88x | 3.69x | 4.38x | 4.92x | 4.89x
-------+-------+-------+-------+-------+-------+-------+-------
10 | 1.00x | 1.45x | 1.92x | 2.19x | 2.36x | 2.28x | 2.49x
10 | 1.00x | 1.65x | 2.25x | 3.10x | 3.48x | 3.91x | 4.48x
-------+-------+-------+-------+-------+-------+-------+-------
12 | 1.00x | 1.50x | 1.76x | 4.71x | 5.66x | 6.61x | 7.61x
12 | 1.00x | 1.81x | 2.65x | 3.36x | 4.14x | 4.78x | 7.43x
-------+-------+-------+-------+-------+-------+-------+-------
14 | 1.00x | 1.40x | 1.68x | 1.86x | 1.97x | 1.95x | 1.95x
14 | 1.00x | 1.50x | 1.98x | 2.47x | 2.76x | 2.98x | 2.95x
-------+-------+-------+-------+-------+-------+-------+-------
16 | 1.00x | 1.01x | 1.25x | 1.31x | 1.35x | 1.38x | 1.39x
16 | 1.00x | 1.22x | 1.36x | 1.43x | 1.47x | 1.53x | 1.53x
-------+-------+-------+-------+-------+-------+-------+-------
18 | 1.00x | 0.86x | 0.93x | 1.08x | 1.11x | 1.22x | 1.15x
18 | 1.00x | 0.93x | 0.98x | 1.08x | 1.11x | 1.22x | 1.20x
-------+-------+-------+-------+-------+-------+-------+-------
21 | 1.00x | 1.12x | 0.49x | 0.59x | 5.10x | 5.67x | 5.18x
21 | 1.00x | 1.44x | 0.62x | 0.80x | 0.95x | 1.08x | 1.22x

[1] https://github.com/tvondra/pg_tpch/blob/master/dss/tpch-index.sql

--
Thomas Munro
http://www.enterprisedb.com

In response to

Re: Parallel Hash take II at 2017-07-31 21:28:08 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2017-10-31 04:24:10	Re: [POC] hash partitioning
Previous Message	Alvaro Herrera	2017-10-31 03:59:00	Re: Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM