Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-05-20 01:25:52
Message-ID: CA+hUKG+2bygZcegbmKV2doaKyBbq8r3qTBbLK5aawZ_chbNbUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 20, 2019 at 12:22 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:
> >First let me restate the PostgreSQL terminology for this stuff so I
> >don't get confused while talking about it:
> >
> >* The inner side of the join = the right side = the side we use to
> >build a hash table. Right and full joins emit inner tuples when there
> >is no matching tuple on the outer side.
> >
> >* The outer side of the join = the left side = the side we use to
> >probe the hash table. Left and full joins emit outer tuples when
> >there is no matching tuple on the inner side.
> >
> >* Semi and anti joins emit exactly one instance of each outer tuple if
> >there is/isn't at least one match on the inner side.
> >
>
> I think you're conflating inner/outer side and left/right, or rather
> assuming it's always left=inner and right=outer.

In PostgreSQL, it's always inner = right, outer = left. You can see
that reflected in plannodes.h and elsewhere:

/* ----------------
* these are defined to avoid confusion problems with "left"
* and "right" and "inner" and "outer". The convention is that
* the "left" plan is the "outer" plan and the "right" plan is
* the inner plan, but these make the code more readable.
* ----------------
*/
#define innerPlan(node) (((Plan *)(node))->righttree)
#define outerPlan(node) (((Plan *)(node))->lefttree)

I'm not sure you think it's not always like that: are you referring to
the fact that the planner can choose to reverse the join (compared to
the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra
layer of confusion? In my email I was talking only about left and
right as seen by the executor.

> >About the question of when exactly to set the "use_NLJ" flag: I had
> >originally been thinking of this only as a way to deal with the
> >extreme skew problem. But in light of Tomas's complaints about
> >unmetered per-batch memory overheads, I had a new thought: it should
> >also be triggered whenever doubling the number of batches would halve
> >the amount of memory left for the hash table (after including the size
> >of all those BufFile objects in the computation as Tomas proposes). I
> >think that might be exactly the right right cut-off if you want to do
> >as much Grace partitioning as your work_mem can afford, and therefore
> >as little looping as possible to complete the join while respecting
> >work_mem.
> >
>
> Not sure what NLJ flag rule you propose, exactly.
>
> Regarding the threshold value - once the space for BufFiles (and other
> overhead) gets over work_mem/2, it does not make any sense to increase
> the number of batches because then the work_mem would be entirely
> occupied by BufFiles.
>
> The WIP patches don't actually do exactly that though - they just check
> if the incremented size would be over work_mem/2. I think we should
> instead allow up to work_mem*2/3, i.e. stop adding batches after the
> BufFiles start consuming more than work_mem/3 memory.
>
> I think that's actually what you mean by "halving the amount of memory
> left for the hash table" because that's what happens after reaching the
> work_mem/3.

Well, instead of an arbitrary number like work_mem/2 or work_mem *
2/3, I was trying to figure out the precise threshold beyond which it
doesn't make sense to expend more memory on BufFile objects, even if
the keys are uniformly distributed so that splitting batches halves
the expect tuple count per batch. Let work_mem_for_hash_table =
work_mem - nbatch * sizeof(BufFile). Whenever you increase nbatch,
work_mem_for_hash_table goes down, but it had better be more than half
what it was before, or we expect to run out of memory again (if the
batch didn't fit before, and we're now splitting it so that we'll try
to load only half of it, we'd better have more than half the budget
for the hash table than we had before). Otherwise you'd be making
matters worse, and this process probably won't terminate.

> But I think that rule is irrelevant here, really, because this thread
> was discussing cases where adding batches is futile due to skew, no? In
> which case we should stop adding batches after reaching some % of tuples
> not moving from the batch.

Yeah, this thread started off just about the 95% thing, but veered off
course since these topics are tangled up. Sorry.

> Or are you suggesting we should remove that rule, and instead realy on
> this rule about halving the hash table space? That might work too, I
> guess.

No, I suspect you need both rules. We still want to detect extreme
skew soon as possible, even though the other rule will eventually
fire; might as well do it sooner in clear-cut cases.

> OTOH I'm not sure it's a good idea to handle both those cases the same
> way - "overflow file" idea works pretty well for cases where the hash
> table actually can be split into batches, and I'm afraid NLJ will be
> much less efficient for those cases.

Yeah, you might be right about that, and everything I'm describing is
pure vapourware anyway. But your overflow file scheme isn't exactly
free of IO-amplification and multiple-processing of input data
either... and I haven't yet grokked how it would work for parallel
hash. Parallel hash generally doesn't have the
'throw-the-tuples-forward' concept. which is inherently based on
sequential in-order processing of batches.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-05-20 01:33:10 Re: vacuumdb and new VACUUM options
Previous Message Noah Misch 2019-05-20 01:24:36 Re: Move regression.diffs of pg_upgrade test suite