Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesse Zhang <sbjesse(at)gmail(dot)com>, dkimura(at)pivotal(dot)io, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, david(dot)g(dot)kimura(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-06-09 00:12:25
Message-ID: CAAKRu_YKsO6=GMN_6SMeJwuRXEbb1o2mtReHT-GULXt9mtnKYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 27, 2020 at 7:25 PM Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:

> I've attached a rebased patch which includes the "provisionally detach"
> deadlock hazard fix approach
>

Alas, the "provisional detach" logic proved incorrect (see last point in
the list of changes included in the patch at bottom).

> Also, we kept the batch 0 spilling patch David Kimura authored [1]
> separate so it could be discussed separately because we still had some
> questions.
>

The serial batch 0 spilling is in the attached patch. Parallel batch 0
spilling is still in a separate batch that David Kimura is working on.

I've attached a rebased and updated patch with a few fixes:

- semi-join fallback works now
- serial batch 0 spilling in main patch
- added instrumentation for stripes to the parallel case
- SharedBits uses same SharedFileset as SharedTuplestore
- reverted the optimization to allow workers to re-attach to a batch and
help out with stripes if they are sure they pose no deadlock risk

For the last point, I discovered a pretty glaring problem with this
optimization: I did not include the bitmap created by a worker while
working on its first participating stripe in the final combined bitmap.
I only was combining the last bitmap file each worker worked on.

I had the workers make new bitmaps for each time that they attached to
the batch and participated because having them keep an open file
tracking information for a batch they are no longer attached to on the
chance that they might return and work on that batch was a
synchronization nightmare. It was difficult to figure out when to close
the file if they never returned and hard to make sure that the combining
worker is actually combining all the files from all participants who
were ever active.

I am sure I can hack around those, but I think we need a better solution
overall. After reverting those changes, loading and probing of stripes
after stripe 0 is serial. This is not only sub-optimal, it also means
that all the synchronization variables and code complexity around
coordinating work on fallback batches is practically wasted.
So, they have to be able to collaborate on stripes after the first
stripe. This version of the patch has correct results and no deadlock
hazard, however, it lacks parallelism on stripes after stripe 0.
I am looking for ideas on how to address the deadlock hazard more
efficiently.

The next big TODOs are:
- come up with a better solution to the potential tuple emitting/barrier
waiting deadlock issue
- parallel batch 0 spilling complete

--
Melanie Plageman

Attachment Content-Type Size
v9-0001-Implement-Adaptive-Hashjoin.patch text/x-patch 170.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-06-09 00:21:53 Re: BufFileRead() error signalling
Previous Message Tom Lane 2020-06-09 00:11:43 Re: BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan