|From:||Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>|
|To:||Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>|
|Cc:||Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: WIP: [[Parallel] Shared] Hash|
|Views:||Raw Message | Whole Thread | Download mbox|
On Thu, Feb 2, 2017 at 4:57 PM, Rafia Sabih
> On Thu, Feb 2, 2017 at 1:19 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih
>> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>>> [ regressions ]
>> Thanks Rafia. At first glance this plan is using the Parallel Shared
>> Hash in one place where it should pay off, that is loading the orders
>> table, but the numbers are terrible. I noticed that it uses batch
>> files and then has to increase the number of batch files, generating a
>> bunch of extra work, even though it apparently overestimated the
>> number of rows, though that's only ~9 seconds of ~60. I am
> Hi Thomas,
> Apart from the previously reported regression, there appear one more
> issue in this set of patches. At times, running a query using parallel
> hash it hangs up and all the workers including the master shows the
> following backtrace,
Here's a new version to fix the problems reported by Rafia above. The
patch descriptions are as before but it starts from 0002 because 0001
was committed as 7c5d8c16 (thanks, Andres).
First, some quick master-vs-patch numbers from the queries listed with
regressions, using TPCH dbgen scale 10, work_mem = 64MB,
max_parallel_workers_per_gather = 4, shared_buffers = 8GB (the numbers
themselves not comparable as different scale and different hardware).
Better except for Q5 and Q8, which for some mysterious reason plans
only one worker and then loses. I'm looking into that.
Q3 19917.682 -> 8649.822
Q5 4149.983 -> 4192.551
Q7 14453.721 -> 10303.911
Q8 1981.540 -> 8030.264
Q9 26928.102 -> 17384.607
Q10 16955.240 -> 14563.787
I plan to explore the performance space with a range of worker numbers
and work_mem sizes and do some analysis; more soon.
1. Fixed two bugs that resulted in ExecHashShrink sometimes hanging,
as reported by Rafia. (1) When splitting the large v3 patch up into
smaller patches for v4, I'd managed to lose the line that initialises
shared->shrink_barrier, causing some occasional strange behaviour.
(2) I found a bug in condition_variable.c that could cause hangs
and fixed that via a separate patch and the fix was committed as
3f3d60d3 (thanks, Robert).
2. Simplified barrier.c by removing BarrierWaitSet(), because that
turned out to be unnecessary to implement rescan as I'd originally
thought, and was incompatible with the way BarrierDetach() works. The
latter assumes that the phase only ever increments, so that
combination of features was broken.
3. Sorted out the hash table sizing logic that was previously leading
to some strange decisions about batches. This involved putting the
total estimated number of inner rows into the path and plan when there
is a partial inner plan, because plan_rows only has the partial
number. I need to size the hash table correctly at execution time.
It seems a bit strange to do that specifically and only for Hash (see
rows_total in the 0008 patch)... should there be some more generic
way? Should total rows go into Plan rather than HashPlan, or perhaps
the parallel divisor should go somewhere?
4. Comments fixed and added based on Ashutosh's feedback on patch 0003.
5. Various small bug fixes.
I've also attached a small set of test queries that hit the four
"modes" (for want of a better word) of our hash join algorithm for
dealing with different memory conditions, which I've nicknamed thus:
1. "Good": We estimate that the hash table will fit in work_mem, and
at execution time it does. This patch makes that more likely because
[Parallel] Shared Hash gets to use more work_mem as discussed.
2. "Bad": We estimate that the hash table won't fit in work_mem, but
that if we partition it into N batches using some bits from the hash
value then each batch will fit in work_mem. At execution time, each
batch does indeed fit into work_mem. This is not ideal, because we
have to write out and read back in N - (1 / N) inner and outer tuples
(ie all batches except the first one, although actually costsize.c
always charges for all of them). But it may still be better than
other plans, and the IO is sequential. Currently Shared Hash
shouldn't be selected over (private) Hash if it would require batching
anyway due to the cpu_shared_tuple_cost tie-breaker: on the one had it
avoids a bunch of copies of the batch files being written out, but on
the other it introduces a bunch of synchronisation overhead. Parallel
Shared Hash is fairly likely to be chosen if possible be due to
division of the inner relation's cost outweighing
3. "Ugly": We planned for "good" or "bad" mode, but we ran out of
work_mem at some point during execution: this could be during the
initial hash table load, or while loading a subsequent batch. So now
we double the number of batches, splitting the current batch and all
batches that haven't been processed yet into two in the hope of
shrinking the hash table, while generating extra reading and writing
of all as-yet unprocessed tuples. This patch can do the shrinking
work in parallel, which may help.
4. "Fail": After reaching "ugly" mode (and perhaps trying multiple
times to shrink the hash table), we deduce that there is a kind of
extreme skew that our partitioning scheme can never help with. So we
stop respecting work_mem and hope for the best. The hash join may or
may not be able to complete, depending on how much memory you can
successfully allocate without melting the server or being killed by
the OOM reaper.
The "ugly" mode was added in 2005, so before that we had only
"good", "bad" and "fail". We don't ever want to be in "ugly" or
"fail" modes: a sort merge join would have been better, or in any
case is guaranteed to be able to run to completion in the configured
space. However, at the point where we reach this condition, there
isn't anything else we can do.
Some other interesting cases that hit new code are: rescan with single
batch (reuses the hash table contents), rescan with multiple batches
(blows away and rebuilds the hash table), outer join (scans hash table
for unmatched tuples). Outer joins are obviously easy to test but
rescans are a bit tricky to reach... one way is to run TPCH Q9 with
cph_shared_tuple_cost = -10 (I think what's happening here is that
it's essentially running the optimiser in reverse, and a nested loop
rescanning a gather node (= fork/exit workers for every loop) is about
the worst plan imaginable), but I haven't found a short and sweet test
query for that yet.
Some assorted thoughts:
* Instead of abandoning our work_mem limit in "fail" mode, you might
think we could probe the portion of the hash table that we managed to
load so far, then rewind the outer batch and probe again using the
next work_mem-sized portion of the same inner batch file. This
doesn't work though because in the case of work_mem exhaustion during
the initial batch it's too late to decide to start recording the the
initial outer batch, so we have no way to rewind.
* Instead of using the shared hash table for batch mode, we could do
just the initial batch with a shared hash table, but drop back to
smaller private hash tables for later batches and give each worker its
own batch to work until they're all done with no further
communication. There are some problems with this though: inability to
handle outer joins (just like parallel hash join in 9.6), limit of
work_mem (not work_mem * P) for the private hash tables, load
balancing/granularity problems with skewed data. Thanks to my
colleague Ashutosh Bapat for this off-list suggestion.
One of the unpleasant things about this patch is the risk of deadlock,
as already discussed. I wanted to mention an idea for how to get rid
of this problem eventually. I am aware of two ways that a deadlock
1. A worker is waiting to write into its tuple queue (because the
reader is not consuming fast enough and its fixed buffer has filled
up), but the leader (which should be reading the tuple queue) is stuck
waiting for the worker. This is avoided currently with the early-exit
protocol, at the cost of losing a CPU core after probing the first
2. Two different hash joins run in non-deterministic order. Workers
A and B have executed hash join nodes 1 and 2 at least once and
attached to the barrier, and now Worker A is in hash join node 1, and
worker B is in hash join node 2 at a barrier wait point. I am not
aware of any executor nodes that could do that currently, but there is
nothing to say that future nodes couldn't do that. If I am wrong
about that and this could happen today, that would be fatal for this
patch in its current form.
Once we have asynchronous execution infrastructure, perhaps we could
make those problems go away like this:
1. Introduce a new way for barrier clients to try to advance to the
next phase, but detach and return immediately if they would have to
2. Introduce a way for barriers to participate in the the readiness
protocol used for async execution, so that barrier advances counts as
a kind of readiness. (The asynchronous scheduler probably doesn't
need to know anything about that since it's based on latches which the
WaitSet API already knows how to multiplex.)
3. Teach Hash Join to yield instead of waiting at barriers, asking to
be executed again when the barrier might have advanced.
4. Make sure the Gather node is suitably asynchronicity-aware. At a
minimum it should be able to deal with the child plan yielding (in the
case where it runs in the leader due to lack of better things to do)
and be able to try that again when it needs to.
|Next Message||Amit Langote||2017-02-13 10:57:59||Re: Documentation improvements for partitioning|
|Previous Message||Erik Rijkers||2017-02-13 10:09:12||Re: Logical replication existing data copy - sgml fixes|