Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-05-19 23:07:03
Message-ID: CA+hUKGKPHVOLeROop7OTdJ=Ty0vC=mk6QExfxcRWBMLHFLH_PQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> Admittedly I don't have a patch, just a bunch of handwaving. One
>> reason I haven't attempted to write it is because although I know how
>> to do the non-parallel version using a BufFile full of match bits in
>> sync with the tuples for outer joins, I haven't figured out how to do
>> it for parallel-aware hash join, because then each loop over the outer
>> batch could see different tuples in each participant. You could use
>> the match bit in HashJoinTuple header, but then you'd have to write
>> all the tuples out again, which is more IO than I want to do. I'll
>> probably start another thread about that.
>
> Could you explain more about the implementation you are suggesting?
>
> Specifically, what do you mean "BufFile full of match bits in sync with the
> tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple. To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch. After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()). To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table. Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately. But that simple strategy won't work if the
hash table holds only part of the inner batch. Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop. But the later loops don't know on their own either. So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops. So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side. To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match. To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling. So one idea
is a BufFile that has one bit per outer tuple in the batch. In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO... Note that parallel hash doesn't support right/full joins today,
because of some complications about waiting and deadlocks that might
turn out to be relevant here too, and might be solvable (I should
probably write about that in another email), but left joins *are*
supported today so would need to be desupported if we wanted to add
loop-based escape valve but not deal with with these problems. That
doesn't seem acceptable, which is why I'm a bit stuck on this point,
and unfortunately it may be a while before I have time to tackle any
of that personally.

> Is the implementation you are thinking of one which falls back to NLJ on a
> batch-by-batch basis decided during the build phase?

Yeah.

> If so, why do you need to keep track of the outer tuples seen?
> If you are going to loop through the whole outer side for each tuple on the
> inner side, it seems like you wouldn't need to.

The idea is to loop through the whole outer batch for every
work_mem-sized inner batch fragment, not every tuple. Though in
theory it could be as small as a single tuple.

> Could you make an outer "batch" which is the whole of the outer relation? That
> is, could you do something like: when hashing the inner side, if re-partitioning
> is resulting in batches that will overflow spaceAllowed, could you set a flag on
> that batch use_NLJ and when making batches for the outer side, make one "batch"
> that has all the tuples from the outer side which the inner side batch which was
> flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch. The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

About the question of when exactly to set the "use_NLJ" flag: I had
originally been thinking of this only as a way to deal with the
extreme skew problem. But in light of Tomas's complaints about
unmetered per-batch memory overheads, I had a new thought: it should
also be triggered whenever doubling the number of batches would halve
the amount of memory left for the hash table (after including the size
of all those BufFile objects in the computation as Tomas proposes). I
think that might be exactly the right right cut-off if you want to do
as much Grace partitioning as your work_mem can afford, and therefore
as little looping as possible to complete the join while respecting
work_mem.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-05-19 23:24:27 Re: sample scans and predicate locking
Previous Message Andres Freund 2019-05-19 22:55:06 Do we expect tests to work with default_transaction_isolation=serializable