Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesse Zhang <sbjesse(at)gmail(dot)com>, dkimura(at)pivotal(dot)io, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, david(dot)g(dot)kimura(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-06-25 22:09:44
Message-ID: CAAKRu_Z9umj-FNW2ta0yHzhK1LHMBsDFM_CFevHH78=xGiaqBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> I started looking at the patch to refresh my knowledge both of this
> patch and parallel hash join, but I think it needs a rebase. The
> changes in 7897e3bb90 apparently touched some of the code.

Thanks so much for the review, Tomas!

I've attached a rebased patch which also contains updates discussed
below.

> I assume
> you're working on a patch addressing the remaining TODOS, right?
>

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.
Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those. Otherwise, I
will prioritize parallel batch 0 spilling.

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.
David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.

>
> I see you've switched to "stripe" naming - I find that a bit confusing,
> because when I hear stripe I think about RAID, where it means pieces of
> data interleaved and stored on different devices. But maybe that's just
> me and it's a good name. Maybe it'd be better to keep the naming and
> only tweak it at the end, not to disrupt reviews unnecessarily.
>

I hear you about "stripe". I still quite like it, especially as compared
to its predecessor (originally, I called them chunks -- which is
impossible given that SharedTuplestoreChunks are a thing).

For ease of review, as you mentioned, I will keep the name for now. I am
open to changing it later, though.

I've been soliciting ideas for alternatives and, so far, folks have
suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
still on team "stripe" though, as it stands.

>
> nodeHash.c
> ----------
>
>
> 1) MultiExecPrivateHash says this
>
> /*
> * Not subject to skew optimization, so either insert normally
> * or save to batch file if it belongs to another stripe
> */
>
> I wonder what it means to "belong to another stripe". I understand what
> that means for batches, which are identified by batchno computed from
> the hash value. But I thought "stripes" are just work_mem-sized pieces
> of a batch, so I don't quite understand this. Especially when the code
> does not actually check "which stripe" the row belongs to.
>
>
I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:

- * or save to batch file if it belongs to another stripe
+ * or save to batch file if batch 0 falls back and we have
+ * already filled the hashtable up to space_allowed.

> 2) I find the fields hashloop_fallback rather confusing. We have one in
> HashJoinTable (and it's array of BufFile items) and another one in
> ParallelHashJoinBatch (this time just bool).
>
> I think HashJoinTable should be renamed to hashloopBatchFile (similarly
> to the other BufFile arrays).

I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.

>
> Although I'm not sure why we even need
> this file, when we have innerBatchFile? BufFile(s) are not exactly free,
> in fact it's one of the problems for hashjoins with many batches.
>
>
Interesting -- it didn't even occur to me to combine the bitmap with the
inner side batch file data.
It definitely seems like a good idea to save the BufFile given that so
little data will likely go in it and that it has a 1-1 relationship with
inner side batches.

How might it work? Would you reserve some space at the beginning of the
file? When would you reserve the bytes (before adding tuples you won't
know how many bytes you need, so it might be hard to make sure there is
enough space.) Would all inner side files have space reserved or just
fallback batches?

--
Melanie Plageman

Attachment Content-Type Size
v10-0001-Implement-Adaptive-Hashjoin.patch text/x-patch 170.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-06-25 22:23:11 Re: create database with template doesn't copy database ACL
Previous Message Jeff Davis 2020-06-25 21:28:02 Re: Default setting for enable_hashagg_disk