Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-05-20 14:31:52
Message-ID: 20190520143152.77nrjewyd3mbsbyj@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, May 20, 2019 at 01:25:52PM +1200, Thomas Munro wrote:
>On Mon, May 20, 2019 at 12:22 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:
>> >First let me restate the PostgreSQL terminology for this stuff so I
>> >don't get confused while talking about it:
>> >
>> >* The inner side of the join = the right side = the side we use to
>> >build a hash table. Right and full joins emit inner tuples when there
>> >is no matching tuple on the outer side.
>> >
>> >* The outer side of the join = the left side = the side we use to
>> >probe the hash table. Left and full joins emit outer tuples when
>> >there is no matching tuple on the inner side.
>> >
>> >* Semi and anti joins emit exactly one instance of each outer tuple if
>> >there is/isn't at least one match on the inner side.
>> >
>> I think you're conflating inner/outer side and left/right, or rather
>> assuming it's always left=inner and right=outer.
>In PostgreSQL, it's always inner = right, outer = left. You can see
>that reflected in plannodes.h and elsewhere:
>/* ----------------
> * these are defined to avoid confusion problems with "left"
> * and "right" and "inner" and "outer". The convention is that
> * the "left" plan is the "outer" plan and the "right" plan is
> * the inner plan, but these make the code more readable.
> * ----------------
> */
>#define innerPlan(node) (((Plan *)(node))->righttree)
>#define outerPlan(node) (((Plan *)(node))->lefttree)
>I'm not sure you think it's not always like that: are you referring to
>the fact that the planner can choose to reverse the join (compared to
>the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra
>layer of confusion? In my email I was talking only about left and
>right as seen by the executor.

It might be my lack of understanding, but I'm not sure how we map
LEFT/RIGHT JOIN to left/righttree and inner/outer at plan level. My
assumption was that for "a LEFT JOIN b" then "a" and "b" can end up
both as inner and outer (sub)tree.

But I haven't checked so I may easily be wrong. Maybe the comment you
quoted clarifies that, not sure.

>> >About the question of when exactly to set the "use_NLJ" flag: I had
>> >originally been thinking of this only as a way to deal with the
>> >extreme skew problem. But in light of Tomas's complaints about
>> >unmetered per-batch memory overheads, I had a new thought: it should
>> >also be triggered whenever doubling the number of batches would halve
>> >the amount of memory left for the hash table (after including the size
>> >of all those BufFile objects in the computation as Tomas proposes). I
>> >think that might be exactly the right right cut-off if you want to do
>> >as much Grace partitioning as your work_mem can afford, and therefore
>> >as little looping as possible to complete the join while respecting
>> >work_mem.
>> >
>> Not sure what NLJ flag rule you propose, exactly.
>> Regarding the threshold value - once the space for BufFiles (and other
>> overhead) gets over work_mem/2, it does not make any sense to increase
>> the number of batches because then the work_mem would be entirely
>> occupied by BufFiles.
>> The WIP patches don't actually do exactly that though - they just check
>> if the incremented size would be over work_mem/2. I think we should
>> instead allow up to work_mem*2/3, i.e. stop adding batches after the
>> BufFiles start consuming more than work_mem/3 memory.
>> I think that's actually what you mean by "halving the amount of memory
>> left for the hash table" because that's what happens after reaching the
>> work_mem/3.
>Well, instead of an arbitrary number like work_mem/2 or work_mem *
>2/3, I was trying to figure out the precise threshold beyond which it
>doesn't make sense to expend more memory on BufFile objects, even if
>the keys are uniformly distributed so that splitting batches halves
>the expect tuple count per batch. Let work_mem_for_hash_table =
>work_mem - nbatch * sizeof(BufFile). Whenever you increase nbatch,
>work_mem_for_hash_table goes down, but it had better be more than half
>what it was before, or we expect to run out of memory again (if the
>batch didn't fit before, and we're now splitting it so that we'll try
>to load only half of it, we'd better have more than half the budget
>for the hash table than we had before). Otherwise you'd be making
>matters worse, and this process probably won't terminate.

But the work_mem/3 does exactly that.

Let's say BufFiles need a bit less than work_mem/3. That means we have
a bit more than 2*work_mem/3 for the hash table. If you double the number
of batches, then you'll end up with a bit more than 2*work_mem/3. That is,
we've not halved the hash table size.

If BufFiles need a bit more memory than work_mem/3, then after doubling
the number of batches we'll end up with less than half the initial hash
table space.

So I think work_mem/3 is the threshold we're looking for.

>> But I think that rule is irrelevant here, really, because this thread
>> was discussing cases where adding batches is futile due to skew, no? In
>> which case we should stop adding batches after reaching some % of tuples
>> not moving from the batch.
>Yeah, this thread started off just about the 95% thing, but veered off
>course since these topics are tangled up. Sorry.
>> Or are you suggesting we should remove that rule, and instead realy on
>> this rule about halving the hash table space? That might work too, I
>> guess.
>No, I suspect you need both rules. We still want to detect extreme
>skew soon as possible, even though the other rule will eventually
>fire; might as well do it sooner in clear-cut cases.

Right, I agree. I think we need the 95% rule (or whatever) to handle the
cases with skew / many duplicates, and then the overflow files to handle
underestimates with uniform distribution (or some other solution).

>> OTOH I'm not sure it's a good idea to handle both those cases the same
>> way - "overflow file" idea works pretty well for cases where the hash
>> table actually can be split into batches, and I'm afraid NLJ will be
>> much less efficient for those cases.
>Yeah, you might be right about that, and everything I'm describing is
>pure vapourware anyway. But your overflow file scheme isn't exactly
>free of IO-amplification and multiple-processing of input data
>either... and I haven't yet grokked how it would work for parallel
>hash. Parallel hash generally doesn't have the
>'throw-the-tuples-forward' concept. which is inherently based on
>sequential in-order processing of batches.

Sure, let's do some math.

With the overflow scheme, the amplification is roughly ~2x (relative to
master), because we need to write data for most batches first into the
overflow file and then to the correct one. Master has wrte aplification
about ~1.25x (due to the gradual increase of batches), so the "total"
amplification is ~2.5x.

For the NLJ, the amplification fully depends on what fraction of the hash
table fits into work_mem. For example when it needs to be split into 32
fragments, we have ~32x amplification. It might affect just some batches,
of course.

So I still think those approaches are complementary and we need both.


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-05-20 14:45:17 Re: Multivariate MCV stats can leak data to unprivileged users
Previous Message Robert Haas 2019-05-20 14:29:29 Re: Create TOAST table only if AM needs