Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesse Zhang <sbjesse(at)gmail(dot)com>, dkimura(at)pivotal(dot)io, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, david(dot)g(dot)kimura(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-05-09 01:58:10
Message-ID: CAAKRu_Zn91PvgyzD111i99bkZ==7zcBKP-+VvLD38aiT5kaSNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 28, 2020 at 7:03 PM Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:

>
> There is a deadlock hazard in parallel hashjoin (pointed out by Thomas
> Munro in the past). Workers attached to the stripe_barrier emit tuples
> and then wait on that barrier.
> I believe that that can be addressed starting with this
> relatively unoptimized solution:
> - after probing a stripe in a batch, a worker sets the status of that
> batch to "tentatively done" and saves the stripe_barrier phase
> - if that worker is not the only worker attached to that batch, it
> detaches from both stripe and batch barriers and moves on to other
> batches
> - if that worker is the only worker attached to the batch, it will
> proceed to load the next stripe of that batch, and, once it has
> finished loading, it will set the status of the batch back to "not
> done" for itself
> - when the other worker encounters that batch again, if the
> stripe_barrier phase has not moved forward, it will mark that batch as
> done for itself. if the stripe_barrier phase has moved forward, it can
> join in in probing this batch for the current stripe.
>

Just to follow-up on the stripe barrier deadlock, I've implemented a
solution and attached it.

There are three solutions I've thought about so far:

1) leaders don't participate in fallback batches
2) serial after stripe 0
no worker can join a batch after any worker has left and only one
worker can work on stripes after stripe 0
3) provisionally complete batches
After the end of stripe 0, all workers except the last worker
detach from the stripe barrier, mark the batch as provisionally
done, save the stripe barrier phase, and move on to another batch.
Later, when one of these workers returns to the batch, if it is
not already done, the worker checks to see if the phase of the
stripe barrier has advanced. If the phase has advanced, it means
that no one is waiting for that worker. The worker can join that
batch. If the phase hasn't advanced, the worker won't risk
deadlock and will simply mark the batch as done. The last worker
executes the normal path -- participating in each stripe.

I've attached a patch to implement solution 3
v7-0002-Provisionally-detach-unless-last-worker.patch

This isn't a very optimized version of this solution. It detaches from
the stripe barrier and closes the outer match status bitmap upon
provisional completion by a worker. However, I ran into some problems
keeping outer match status bitmaps open for multiple batches at a time.

I've also attached the original adaptive hashjoin patch with a couple
small tweaks (not quite meriting a patch version bump, but that seemed
like the easiest way).

--
Melanie Plageman

Attachment Content-Type Size
v7-0002-Provisionally-detach-unless-last-worker.patch text/x-patch 9.4 KB
v7-0001-Implement-Adaptive-Hashjoin.patch text/x-patch 131.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-05-09 03:26:22 Re: Back-patch is necessary? Re: Don't try fetching future segment of a TLI.
Previous Message Alvaro Herrera 2020-05-09 00:49:20 Re: stat() on Windows might cause error if target file is larger than 4GB