Re: [HACKERS] [POC] Faster processing at Gather node

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [POC] Faster processing at Gather node
Date: 2017-11-15 18:48:18
Message-ID: CA+Tgmoa5azAJEQLHhCAfdyw+N7qJWTBm3DVGT7aqA94-r9Cmaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 14, 2017 at 7:31 AM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
> Case 2: patches applied as in case 1 +
> a) increased PARALLEL_TUPLE_QUEUE_SIZE to 655360
> No significant change in performance in any query
> b) increased PARALLEL_TUPLE_QUEUE_SIZE to 65536 * 50
> Performance improved from 20s to 11s for Q12
> c) increased PARALLEL_TUPLE_QUEUE_SIZE to 6553600
> Q12 shows improvement in performance from 20s to 7s
>
> Case 3: patch applied = faster_gather_v3 as posted at [1]
> Q12 shows improvement in performance from 20s to 8s

I think that we need a little bit deeper analysis here to draw any
firm conclusions. My own testing showed about a 2x performance
improvement with all 4 patches applied on a query that did a Gather
Merge with many rows. Now, your testing shows the patches aren't
helping at all. But what accounts for the difference between your
results? Without some analysis of that question, this is just a data
point that probably doesn't get us very far.

I suspect that one factor is that many of the queries actually send
very few rows through the Gather. You didn't send EXPLAIN ANALYZE
outputs for these runs, but I went back and looked at some old tests I
did on a small scale factor and found that, on those tests, Q2, Q6,
Q13, Q14, and Q15 didn't use parallelism at all, while Q1, Q4, Q5, Q7,
Q8, Q9, Q11, Q12, Q19, and Q22 used parallelism, but sent less than
100 rows through Gather. Obviously, speeding up Gather isn't going to
help at all when only a tiny number of rows are being sent through it.
The remaining seven queries sent the following numbers of rows through
Gather:

3: -> Gather Merge (cost=708490.45..1110533.81
rows=3175456 width=0) (actual time=21932.675..22150.587 rows=118733
loops=1)

10: -> Gather Merge (cost=441168.55..513519.51
rows=574284 width=0) (actual time=15281.370..16319.895 rows=485230
loops=1)

16: -> Gather
(cost=1000.00..47363.41 rows=297653 width=40) (actual
time=0.414..272.808 rows=297261 loops=1)

17: -> Gather (cost=1000.00..12815.71
rows=2176 width=4) (actual time=2.105..152.966 rows=1943 loops=1)
17: -> Gather Merge
(cost=2089193.30..3111740.98 rows=7445304 width=0) (actual
time=14071.064..33135.996 rows=9973440 loops=1)

18: -> Gather Merge (cost=3271973.63..7013135.71
rows=29992968 width=0) (actual time=81581.450..81581.594 rows=112
loops=1)

20: -> Gather
(cost=1000.00..13368.31 rows=20202 width=4) (actual time=0.361..19.035
rows=21761 loops=1)

21: -> Gather (cost=1024178.86..1024179.27
rows=4 width=34) (actual time=12367.266..12377.991 rows=17176 loops=1)

Of those, Q18 is probably uninteresting because it only sends 112
rows, and Q20 and Q16 are probably uninteresting because the Gather
only executed for 19 ms and 272 ms respectively. Q21 doesn't look
interesting because we ran for 12337.991 seconds and only sent 17176
rows - so the bottleneck is probably generating the tuples, not
sending them. The places where you'd expect the patch set to help are
where a lot of rows are being sent through the Gather or Gather Merge
node very quickly - so with these plans, probably Q17 is the only that
would have the best chance of going faster with these patches and
maybe Q3 might benefit a bit.

Now obviously your plans are different -- otherwise you couldn't be
seeing a speedup on Q12. So you have to look at the plans and try to
understand what the big picture is here. Spending a lot of time
running queries where the time taken by Gather is not the bottleneck
is not a good way to figure out whether we've successfully sped up
Gather. What would be more useful? How about:

- Once you've identified the queries where Gather seems like it might
be a bottleneck, run perf without the patch set and see whether Gather
or shm_mq related functions show up high in the profile. If they do,
run perf which the patch set and see if they become less prominent.

- Try running the test cases that Andres and I tried with and without
the patch set. See if it helps on those queries. That will help
verify that your testing procedure is correct, and might also reveal
differences in the effectiveness of that patch set on different
hardware. You could try this experiment on both PPC and x64, or on
both Linux and MacOS, to see whether CPU architecture and/or operating
system plays a role in the effectiveness of the patch.

I think it's a valid finding that increasing the size of the tuple
queue makes Q12 run faster, but I think that's not because it makes
Gather itself any faster. Rather, it's because there are fewer
pipeline stalls. With Gather Merge, whenever a tuple queue becomes
empty, the leader becomes unable to return any more tuples until the
process whose queue is empty generates at least one new tuple. If
there are multiple workers with non-full queues at the same time then
they can all work on generating tuples in parallel, but if every queue
except one is full, and that queue is empty, then there's nothing to
do but wait for that process. I suspect that is fairly common with
the plan you're getting for Q12, which I think looks like this:

Limit
-> GroupAggregate
-> Gather Merge
-> Nested Loop
-> Parallel Index Scan
-> Index Scan

Imagine that you have two workers, and say one of them starts up
slightly faster than the other. So it fills up its tuple queue with
tuples by executing the nested loop. Then the queue is full, so it
sleeps. Now the other worker does the same thing. Ignoring the
leader for the moment, what will happen next is that all of the tuples
produced worker #1 are smaller than all of the tuples from worker #2,
so the gather merge will read and return all of the tuples from the
first worker while reading only a single tuple from the second one.
Then it reverses - we read one more tuple from the first worker while
reading and returning all the tuples from the second one. We're not
reading from the queues evenly, so that the workers keep busy, but are
instead reading long runs of tuples from the same worker while
everybody else waits. Therefore, we're not really getting any
parallelism at all - for the most part, only one worker runs at a
time. Here's a fragment of EXPLAIN ANALYZE output from one of your
old emails on this topic[1]:

-> Gather Merge (cost=1001.19..2721491.60 rows=592261
width=27) (actual time=7.806..44794.334 rows=311095 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Nested Loop (cost=1.13..2649947.55 rows=148065
width=27) (actual time=0.342..9071.892 rows=62257 loops=5)

You can see that we've got 5 participants here (leader + 4 workers).
Each one spends an average of 9.07 seconds executing the nested loop,
but they take 44.8 seconds to finish the whole thing. If they ran
completely sequentially it would have taken 45.4 seconds - so there
was only 0.6 seconds of overlapping execution. If we crank up the
queue size, we will eventually get it large enough that all of the
workers can the plan to completion without filling up the queue, and
then things will indeed get much faster, but again, not because Gather
is any faster, just because then all workers will be running at the
same time.

In some sense, that's OK: a speedup is a speedup. However, to get the
maximum speedup with this sort of plan, it needs to big enough that it
never fills up. How big that is depends on the data set size. If we
make the queue 100x bigger based on these test results, and then you
test on a data set that is 10x bigger, you'll come back and recommend
again making it 10x bigger, because it will again produce a huge
performance gain. On the other hand, if you test a data set that's
only 2x bigger, you'll come back and recommend making the queue 2x
bigger, because that will be good enough. If you test a data set
that's only half as big as this one, you'll probably find that you
don't need to enlarge the queue 100x -- 50x will be good enough.
There is no size that we can make the queue that will be good enough
in general: somebody can always pick a data set large enough that the
queues fill up, and after that only one worker will run at a time on
this plan-shape. Contrariwise, somebody can always pick a small
enough data set that a given queue size just wastes memory without
helping performance.

Similarly, I think that faster_gather_v3.patch is effectively here
because it lets all the workers run at the same time, not because
Gather gets any faster. The local queue is 100x bigger than the
shared queue, and that's big enough that the workers never have to
block, so they all run at the same time and things are great. I don't
see much advantage in pursuing this route. For the local queue to
make sense it needs to have some advantage that we can't get by just
making the shared queue bigger, which is easier and less code. The
original idea was that we'd reduce latch traffic and spinlock
contention by moving data from the local queue to the shared queue in
bulk, but the patches I posted attack those problems more directly.

As a general point, I think we need to separate the two goals of (1)
making Gather/Gather Merge faster and (2) reducing Gather Merge
related pipeline stalls. The patches I posted do (1). With respect
to (2), I can think of three possible approaches:

1. Make the tuple queues bigger, at least for Gather Merge. We can't
fix the problem that the data might be too big to let all workers run
to completion before blocking, but we could make it less likely by
allowing for more space, scaled by work_mem or some new GUC.

2. Have the planner figure out that this is going to be a problem. I
kind of wonder how often it really makes sense to feed a Gather Merge
from a Parallel Index Scan, even indirectly. I wonder if this would
run faster if it didn't use parallelism at all. If there are enough
intermediate steps between the Parallel Index Scan and the Gather
Merge, then the Gather Merge strategy probably makes sense, but in
general it seems pretty sketchy to break the ordered stream of data
that results from an index scan across many processes and then almost
immediately try to reassemble that stream into sorted order. That's
kind of lame.

3. Have Parallel Index Scan do a better job distributing the tuples
randomly across the workers. The problem here happens because, if we
sat and watched which worker produced the next tuple, it wouldn't look
like 1,2,3,4,1,2,3,4,... but rather
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,(many more times),1,1,1,2,2,2,2,2,....
If we could somehow scramble the distribution of tuples to workers so
that this didn't happen, I think it would fix this problem.

Neither (2) nor (3) seem terribly easy to implement so maybe we should
just go with (1), but I feel like that's not a very deep solution to
the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] https://www.postgresql.org/message-id/CAOGQiiOAhNPB7Ow8E4r3dAcLB8LEy_t_oznGeB8B2yQbsj7BFA%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-11-15 18:54:03 Re: [HACKERS] [POC] Faster processing at Gather node
Previous Message Peter Geoghegan 2017-11-15 18:39:58 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)