Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesse Zhang <sbjesse(at)gmail(dot)com>, dkimura(at)pivotal(dot)io, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, david(dot)g(dot)kimura(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-06-23 22:24:05
Message-ID: 20200623222405.5ggfkwdwkm2cqitq@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 08, 2020 at 05:12:25PM -0700, Melanie Plageman wrote:
>On Wed, May 27, 2020 at 7:25 PM Melanie Plageman <melanieplageman(at)gmail(dot)com>
>wrote:
>
>> I've attached a rebased patch which includes the "provisionally detach"
>> deadlock hazard fix approach
>>
>
>Alas, the "provisional detach" logic proved incorrect (see last point in
>the list of changes included in the patch at bottom).
>
>
>> Also, we kept the batch 0 spilling patch David Kimura authored [1]
>> separate so it could be discussed separately because we still had some
>> questions.
>>
>
>The serial batch 0 spilling is in the attached patch. Parallel batch 0
>spilling is still in a separate batch that David Kimura is working on.
>
>I've attached a rebased and updated patch with a few fixes:
>
>- semi-join fallback works now
>- serial batch 0 spilling in main patch
>- added instrumentation for stripes to the parallel case
>- SharedBits uses same SharedFileset as SharedTuplestore
>- reverted the optimization to allow workers to re-attach to a batch and
> help out with stripes if they are sure they pose no deadlock risk
>
>For the last point, I discovered a pretty glaring problem with this
>optimization: I did not include the bitmap created by a worker while
>working on its first participating stripe in the final combined bitmap.
>I only was combining the last bitmap file each worker worked on.
>
>I had the workers make new bitmaps for each time that they attached to
>the batch and participated because having them keep an open file
>tracking information for a batch they are no longer attached to on the
>chance that they might return and work on that batch was a
>synchronization nightmare. It was difficult to figure out when to close
>the file if they never returned and hard to make sure that the combining
>worker is actually combining all the files from all participants who
>were ever active.
>
>I am sure I can hack around those, but I think we need a better solution
>overall. After reverting those changes, loading and probing of stripes
>after stripe 0 is serial. This is not only sub-optimal, it also means
>that all the synchronization variables and code complexity around
>coordinating work on fallback batches is practically wasted.
>So, they have to be able to collaborate on stripes after the first
>stripe. This version of the patch has correct results and no deadlock
>hazard, however, it lacks parallelism on stripes after stripe 0.
>I am looking for ideas on how to address the deadlock hazard more
>efficiently.
>
>The next big TODOs are:
>- come up with a better solution to the potential tuple emitting/barrier
> waiting deadlock issue
>- parallel batch 0 spilling complete
>

Hi Melanie,

I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code. I assume
you're working on a patch addressing the remaining TODOS, right?

I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.

Now, a couple comments / questions about the code.

nodeHash.c
----------

1) MultiExecPrivateHash says this

/*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays). Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.

3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches

childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;

and also about this comment

/*
* TODO: what to do about tuples that don't go to the child
* batch or stay in the current batch? (this is why we are
* counting tuples to child and curbatch with two diff
* variables in case the tuples go to a batch that isn't the
* child)
*/
if (batchno == childbatch)
childbatch_outgoing_tuples++;

I thought each old batch is split into two new ones, and the tuples
either stay in the current one, or are moved to the new one - which I
presume is the childbatch, although I haven't tried to decode that
formula. So where else could the tuple go, as the comment tried to
suggest?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-06-23 23:06:25 Re: Review for GetWALAvailability()
Previous Message Alvaro Herrera 2020-06-23 21:56:10 pg_bsd_indent compiles bytecode