Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, sbjesse(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-01-10 02:37:11
Message-ID: CAAKRu_YOcGjUzGHXUVw8oHHYdsCw9Z+WtTgqCVa7tdhg0HNrXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 7, 2020 at 4:14 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> * I am uneasy about BarrierArriveExplicitAndWait() (a variant of
> BarrierArriveAndWait() that lets you skip directly to a given phase?);
> perhaps you only needed that for a circular phase system, which you
> could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE? I
> tried to make the barrier interfaces look like the libraries in other
> parallel programming environments, and I'd be worried that the
> explicit phase thing could easily lead to bugs.
>

So, I actually use it to circle back up to the first phase while
skipping the last phase.
So I couldn't do it with modular phase numbers and a loop.
The last phase detaches from the chunk barrier. I don't want to detach
from the chunk barrier if there are more chunks.
I basically need a way to only attach to the chunk barrier at the
begininng of the first chunk and only detach at the end of the last
chunk--not in between chunks. I will return from the function and
re-enter between chunks -- say between chunk 2 and chunk 3 of 5.

However, could this be solved by having more than one chunk
barrier?
A worker would attach to one chunk barrier and then when it moves to
the next chunk it would attach to the other chunk barrier and then
switch back when it switches to the next chunk. Then it could detach
and attach each time it enters/leaves the function.

> * I'm not sure it's OK to wait at the end of each loop, as described
> in the commit message:
>
> Workers probing a fallback batch will wait until all workers have
> finished probing before moving on so that an elected worker can read
> and combine the outer match status files into a single bitmap and use
> it to emit unmatched outer tuples after all chunks of the inner side
> have been processed.
>
> Maybe I misunderstood completely, but that seems to break the
> programming rule described in nodeHashjoin.c's comment beginning "To
> avoid deadlocks, ...". To recap: (1) When you emit a tuple, the
> program counter escapes to some other node, and maybe that other node
> waits for thee, (2) Maybe the leader is waiting for you but you're
> waiting for it to drain its queue so you can emit a tuple (I learned a
> proper name for this: "flow control deadlock"). That's why the
> current code only ever detaches (a non-waiting operation) after it's
> begun emitting tuples (that is, the probing phase). It just moves
> onto another batch. That's not a solution here: you can't simply move
> to another loop, loops are not independent of each other like batches.
> It's possible that barriers are not the right tool for this part of
> the problem, or that there is a way to use a barrier that you don't
> remain attached to while emitting, or that we should remove the
> deadlock risks another way entirely[1] but I'm not sure. Furthermore,
> the new code in ExecParallelHashJoinNewBatch() appears to break the
> rule even in the non-looping case (it calls BarrierArriveAndWait() in
> ExecParallelHashJoinNewBatch(), where the existing code just
> detaches).
>

Yea, I think I'm totally breaking that rule.
Just to make sure I understand the way in which I am breaking that
rule:

In my patch, while attached to a chunk_barrier, worker1 emits a
matched tuple (control leaves the current node). Meanwhile, worker2
has finished probing the chunk and is waiting on the chunk_barrier for
worker1.
How though could worker1 be waiting for worker2?

Is this only a problem when one of the barrier participants is the
leader and is reading from the tuple queue? (reading your tuple queue
deadlock hazard example in the thread [1] you referred to).
Basically is my deadlock hazard a tuple queue deadlock hazard?

I thought maybe this could be a problem with nested HJ nodes, but I'm
not sure.

As I understand it, this isn't a problem with current master with
batch barriers because while attached to a batch_barrier, a worker can
emit tuples. No other workers will wait on the batch barrier once they
have started probing.

I need to think more about the suggestions you provided in [1] about
nixing the tuple queue deadlock hazard.

However, hypothetically, if we decide we don't want to break the no
emitting tuples while attached to a barrier rule, how can we still
allow workers to coordinate while probing chunks of the batch
sequentially (1 chunk at a time)?

I could think of two options (both sound slow and bad):

Option 1:
Stash away the matched tuples in a tuplestore and emit them at the end
of the batch (incurring more writes).

Option 2:
Degenerate to 1 worker for fallback batches

Any other ideas?

>
> > - Rename "chunk" (as in chunks of inner side) to something that is
> > not already used in the context of memory chunks and, more
> > importantly, SharedTuplestoreChunk
>
> +1. Fragments? Loops? Blocks (from
> https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike
> that, blocks are also super overloaded).
>

Hmmm. I think loop is kinda confusing. "fragment" has potential.
I also thought of "piece". That is actually where I am leaning now.
What do you think?

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com

--
Melanie Plageman

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-01-10 03:32:19 Re: base backup client as auxiliary backend process
Previous Message Peter 2020-01-10 02:23:42 Re: 12.1 not useable: clientlib fails after a dozen queries (GSSAPI ?)