Re: Parallel Hash take II

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: Parallel Hash take II
Date: 2017-09-18 20:06:51
Message-ID: CA+TgmoY7pqs6KfArrH3WdidPr9KP43cyaseMt9kaNAYfhE3xgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> 3. Gather Merge and Parallel Hash Join may have a deadlock problem.
> Since Gather Merge needs to block waiting for tuples, but workers wait
> for all participants (including the leader) to reach barriers. TPCH
> Q18 (with a certain set of indexes and settings, YMMV) has Gather
> Merge over Sort over Parallel Hash Join, and although it usually runs
> successfully I have observed one deadlock. Ouch. This seems to be a
> more fundamental problem than the blocked TupleQueue scenario. Not
> sure what to do about that.

Thomas and I spent about an hour and a half brainstorming about this
just now. Parallel query doesn't really have a documented deadlock
avoidance strategy, yet all committed and proposed patches other than
this one manage to avoid deadlock. This one has had a number of
problems crop up in this area, so it struck me that it might be
violating a rule which every other patch was following. I struggled
for a bit and finally managed to articulate what I think the
deadlock-avoidance rule is that is generally followed by other
committed and proposed patches:

<rule>
Once you enter a state in which other participants might wait for you,
you must exit that state before doing anything that might wait for
another participant.
</rule>

From this, it's easy to see that the waits-for graph can't contain any
cycles: if every parallel query node obeys the above rule, then a
given node can have in-arcs or out-arcs, but not both. I also believe
it to be the case that every existing node follows this rule. For
instance, Gather and Gather Merge wait for workers, but they aren't at
that point doing anything that can make the workers wait for them.
Parallel Bitmap Heap Scan waits for the leader to finish building the
bitmap, but that leader never waits for anyone else while building the
bitmap. Parallel Index(-Only) Scan waits for the process advancing
the scan to reach the next page, but that process never waits for any
other while so doing. Other types of parallel nodes -- including the
proposed Parallel Append node, which is an interesting case because
like Parallel Hash it appears in the "middle" of the parallel portion
of the plan tree rather than the root like Gather or the leaves like a
parallel scan -- don't wait at all, except for short
spinlock-protected or LWLock-protected critical sections during which
they surely don't go into any sort of long-term wait (which would be
unacceptable for other reasons anyway).

Parallel hash violates this rule only in the case of a multi-batch
hash join, and for only one reason: to avoid blowing out work_mem.
Since, consistent with resource management decisions elsewhere, each
participant is entitled to an amount of memory equal to work_mem, the
shared hash table can and does use up to (participants * work_mem),
which means that we must wait for everybody to be done with the hash
table for batch N before building the hash table for batch N+1. More
properly, if the hash table for the current batch happens to be
smaller than the absolute maximum amount of memory we can use, we can
build the hash table for the next batch up to the point where all the
memory is used, but must then pause and wait for the old hash table to
go away before continuing. But that means that the process for which
we are waiting violated the rule mentioned above: by not being done
with the memory, it's making other processes wait, and by returning a
tuple, it's allowing other parts of the executor to do arbitrary
computations which can themselves wait. So, kaboom.

One simple and stupid way to avoid this deadlock is to reduce the
memory budget for the shared hash table to work_mem and remove the
barriers that prevent more than one such hash table from existing at a
time. In the worst case, we still use (participants * work_mem),
frequently we'll use less, but there are no longer any waits for
processes that might not even be running the parallel has node
(ignoring the moment the problem of right and full parallel hash
joins, which might need more thought). So no deadlock.

We can do better. First, as long as nbatches == 1, we can use a hash
table of up to size (participants * work_mem); if we have to switch to
multiple batches, then just increase the number of batches enough that
the current memory usage drops below work_mem. Second, following an
idea originally by Ashutosh Bapat whose relevance to this issue Thomas
Munro realized during our discussion, we can make all the batches
small enough to fit in work_mem (rather than participants * work_mem
as the current patch does) and spread them across the workers (in the
style of Parallel Append, including potentially deploying multiple
workers against the same batch if there are fewer batches than
workers). Then, single-batch parallel hash joins use the maximum
allowable memory always, and multi-batch parallel hash joins use the
maximum allowable memory after the first batch. Not perfect, but not
bad, and definitely better than deadlocking. Further refinements
might be possible.

If we don't adopt some approach along these lines, then I think we've
got to articulate some alternative deadlock-avoidance rule and make
sure every parallel query facility follows it. I welcome ideas on
that front, but I don't think the rule mentioned above is a bad one,
and I'd obviously like to minimize the amount of rework that we need
to do. Assuming we do settle on the above rule, it clearly needs to
be documented someplace -- not sure of the place. I think that it
doesn't belong in README.parallel because it's an executor-specific
rule, not necessarily a general rule to which other users of
parallelism must adhere; they can choose their own strategies.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-18 20:39:59 Re: Small code improvement for btree
Previous Message Tom Lane 2017-09-18 20:03:40 Re: [PATCH] Make ExplainBeginGroup()/ExplainEndGroup() public.