Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesse Zhang <sbjesse(at)gmail(dot)com>, dkimura(at)pivotal(dot)io, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, david(dot)g(dot)kimura(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-06-26 00:22:32
Message-ID: 20200626002232.33jhfr3s5sy3dmzx@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote:
>On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
>wrote:
>
>> I started looking at the patch to refresh my knowledge both of this
>> patch and parallel hash join, but I think it needs a rebase. The
>> changes in 7897e3bb90 apparently touched some of the code.
>
>
>Thanks so much for the review, Tomas!
>
>I've attached a rebased patch which also contains updates discussed
>below.
>

Thanks.

>
>> I assume
>> you're working on a patch addressing the remaining TODOS, right?
>>
>
>I wanted to get some feedback on the patch before working through the
>TODOs to make sure I was on the right track.
>
>Now that you are reviewing this, I will focus all my attention
>on addressing your feedback. If there are any TODOs that you feel are
>most important, let me know, so I can start with those.
>
>Otherwise, I will prioritize parallel batch 0 spilling.

Feel free to work on the batch 0 spilling, please. I still need to get
familiar with various parts of the parallel hash join etc. so I don't
have any immediate feedback which TODOs to work on first.

>David Kimura plans to do a bit of work on on parallel hash join batch 0
>spilling tomorrow. Whatever is left after that, I will pick up next
>week. Parallel hash join batch 0 spilling is the last large TODO that I
>had.
>
>My plan was to then focus on the feedback (either about which TODOs are
>most important or outside of the TODOs I've identified) I get from you
>and anyone else who reviews this.
>

OK.

>>
>> I see you've switched to "stripe" naming - I find that a bit confusing,
>> because when I hear stripe I think about RAID, where it means pieces of
>> data interleaved and stored on different devices. But maybe that's just
>> me and it's a good name. Maybe it'd be better to keep the naming and
>> only tweak it at the end, not to disrupt reviews unnecessarily.
>>
>
>I hear you about "stripe". I still quite like it, especially as compared
>to its predecessor (originally, I called them chunks -- which is
>impossible given that SharedTuplestoreChunks are a thing).
>

I don't think using chunks in one place means we can't use it elsewhere
in a different context. I'm sure we have "chunks" in other places. But
let's not bikeshed on this too much.

>For ease of review, as you mentioned, I will keep the name for now. I am
>open to changing it later, though.
>
>I've been soliciting ideas for alternatives and, so far, folks have
>suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
>still on team "stripe" though, as it stands.
>

;-)

>
>>
>> nodeHash.c
>> ----------
>>
>>
>> 1) MultiExecPrivateHash says this
>>
>> /*
>> * Not subject to skew optimization, so either insert normally
>> * or save to batch file if it belongs to another stripe
>> */
>>
>> I wonder what it means to "belong to another stripe". I understand what
>> that means for batches, which are identified by batchno computed from
>> the hash value. But I thought "stripes" are just work_mem-sized pieces
>> of a batch, so I don't quite understand this. Especially when the code
>> does not actually check "which stripe" the row belongs to.
>>
>>
>I agree this was confusing.
>
>"belongs to another stripe" meant here that if batch 0 falls back and we
>are still loading it, once we've filled up work_mem, we need to start
>saving those tuples to a spill file for batch 0. I've changed the
>comment to this:
>
>- * or save to batch file if it belongs to another stripe
>+ * or save to batch file if batch 0 falls back and we have
>+ * already filled the hashtable up to space_allowed.
>

OK. Silly question - what does "batch 0 falls back" mean? Does it mean
that we realized the hash table for batch 0 would not fit into work_mem,
so we switched to the "hashloop" strategy?

>
>> 2) I find the fields hashloop_fallback rather confusing. We have one in
>> HashJoinTable (and it's array of BufFile items) and another one in
>> ParallelHashJoinBatch (this time just bool).
>>
>> I think HashJoinTable should be renamed to hashloopBatchFile (similarly
>> to the other BufFile arrays).
>
>
>I think you are right about the name. I've changed the name in
>HashJoinTableData to hashloopBatchFile.
>
>The array of BufFiles hashloop_fallback was only used by serial
>hashjoin. The boolean hashloop_fallback variable is used only by
>parallel hashjoin.
>
>The reason I had them named the same thing is that I thought it would be
>nice to have a variable with the same name to indicate if a batch "fell
>back" for both parallel and serial hashjoin--especially since we check
>it in the main hashjoin state machine used by parallel and serial
>hashjoin.
>
>In serial hashjoin, the BufFiles aren't identified by name, so I kept
>them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
>the status saved (in the struct).
>So, both represented the fall back status of a batch.
>
>However, I agree with you, so I've renamed the serial one to
>hashloopBatchFile.
>

OK

>>
>> Although I'm not sure why we even need
>> this file, when we have innerBatchFile? BufFile(s) are not exactly free,
>> in fact it's one of the problems for hashjoins with many batches.
>>
>>
>Interesting -- it didn't even occur to me to combine the bitmap with the
>inner side batch file data.
>It definitely seems like a good idea to save the BufFile given that so
>little data will likely go in it and that it has a 1-1 relationship with
>inner side batches.
>
>How might it work? Would you reserve some space at the beginning of the
>file? When would you reserve the bytes (before adding tuples you won't
>know how many bytes you need, so it might be hard to make sure there is
>enough space.) Would all inner side files have space reserved or just
>fallback batches?
>

Oh! So the hashloopBatchFile is only used for the bitmap? I haven't
realized that. In that case it probably makes sense to keep it separate
from the files with spilled tuples, interleaving that somehow would be
way too complex, I think.

However, do we need an array of those files? I thought we only need the
bitmap until we process all rows from each "stripe" and then we can
throw it away, right? Which would also mean we don't need to worry about
the memory usage too much, because the 8kB buffer will go away after
calling BufFileClose.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-06-26 00:44:40 Re: xid wraparound danger due to INDEX_CLEANUP false
Previous Message Peter Geoghegan 2020-06-26 00:18:22 Re: xid wraparound danger due to INDEX_CLEANUP false