Re: accounting for memory used for BufFile during hash joins

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-05-07 03:15:11
Message-ID: 20190507031511.lvp6bqn2e5dpsnga@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
><melanieplageman(at)gmail(dot)com> wrote:
>> On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> The second patch tries to enforce work_mem more strictly. That would be
>>> impossible if we were to keep all the BufFile structs in memory, so
>>> instead it slices the batches into chunks that fit into work_mem, and
>>> then uses a single "overflow" file for slices currently not in memory.
>>> These extra slices can't be counted into work_mem, but we should need
>>> just very few of them. For example with work_mem=4MB the slice is 128
>>> batches, so we need 128x less overflow files (compared to per-batch).
>>>
>> I want to see if I understand the implications of the per-slice-overflow patch
>> for execution of hashjoin:
>> For each bucket in the hashtable, when attempting to double the number of
>> batches, if the memory that the BufFile structs will occupy once this is done
>> will exceed the work_mem, split each batch into slices that fit into memory.
>> This means that, for each probe-side tuple hashing to that bucket, you have to
>> load every slice of each batch separately into memory to ensure correct results.
>> Is this right?
>

>Seems expensive for large numbers of slices -- you need to join the
>outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1] https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

>But I wonder how we'd deal with outer joins, as Tom Lane asked in
>another thread:
>
>https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

>>> I'm not entirely sure which of those approaches is the right one. The
>>> first one is clearly just a "damage control" for cases where the hash
>>> side turned out to be much larger than we expected. With good estimates
>>> we probably would not have picked a hash join for those (that is, we
>>> should have realized we can't keep work_mem and prohibit hash join).
>>>
>>> The second patch however makes hash join viable for some of those cases,
>>> and it seems to work pretty well (there are some numbers in the message
>>> posted to pgsql-performance thread). So I kinda like this second one.
>>>
>> So, my initial reaction after taking a look at the patches is that I prefer the
>> first approach--increasing the resize threshhold. The second patch, the
>> per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
>> address what is, based on my understanding, an edge case.
>
>Personally I'd like to make work_mem more reliable, even if it takes a
>major new mechanism.
>

Yeah, I share that attitude.

>Stepping back a bit, I think there is something fishy about the way we
>detect extreme skew. Is that a factor in this case? Right now we
>wait until we have a batch that gets split into child batches
>containing exactly 0% and 100% of the tuples before we give up.
>Previously I had thought of that as merely a waste of time, but
>clearly it's also a waste of unmetered memory. Oops.
>

Yes, that was a factor in the reported query - the data set contained
significant number of duplicate values (~10%) but it took a while to
disable growth because there always happened to be a couple rows with a
different value.

>I think our extreme skew detector should go off sooner, because
>otherwise if you have N nicely distributed unique keys and also M
>duplicates of one bad egg key that'll never fit in memory, we keep
>repartitioning until none of the N keys fall into the batch containing
>the key for the M duplicates before we give up! You can use
>balls-into-bins maths to figure out the number, but I think that means
>we expect to keep splitting until we have N * some_constant batches,
>and that's just silly and liable to create massive numbers of
>partitions proportional to N, even though we're trying to solve a
>problem with M. In another thread I suggested we should stop when
>(say) 95% of the tuples go to one child batch. I'm not sure how you
>pick the number.
>

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

>Of course that doesn't solve the problem that we don't have a better
>plan for dealing with the M duplicates -- it just avoids a needless
>batch explosions triggered by bad maths. I think we need something
>like Tomas's #2, or a way to switch to sort-merge, or some other
>scheme. I'm not sure how to compare the slice idea, which involves
>processing outer tuples * inner slices with the sort-merge idea, which
>involves sorting the inner and outer batch, plus the entirely new
>concept of switching to another node at execution time.
>

Do we actually check how many duplicates are there during planning? I
wonder if we could penalize (of even disable) hashjoins when there are
too many duplicates to fit into work_mem. Of course, that's going to be
tricky with filtering, and so on.

Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

>I also wondered about reducing the buffer size of the BufFiles, but
>that doesn't seem to be fixing the real problem.
>

Yeah. It might help a bit, but it's very limited - even if you reduce
the buffer to say 1kB, it's just a factor of 8. And I'm not sure what
would be the impact on performance.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-05-07 03:18:15 Re: reindexdb & clusterdb broken against pre-7.3 servers
Previous Message Amit Kapila 2019-05-07 03:10:36 Re: Unhappy about API changes in the no-fsm-for-small-rels patch