Re: accounting for memory used for BufFile during hash joins

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-05-07 01:48:40
Message-ID: CA+hUKG+ppih9DtZwGJh4KMZXdtD=c9+fYoS1CSeBQBs9JzZtbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> The second patch tries to enforce work_mem more strictly. That would be
>> impossible if we were to keep all the BufFile structs in memory, so
>> instead it slices the batches into chunks that fit into work_mem, and
>> then uses a single "overflow" file for slices currently not in memory.
>> These extra slices can't be counted into work_mem, but we should need
>> just very few of them. For example with work_mem=4MB the slice is 128
>> batches, so we need 128x less overflow files (compared to per-batch).
>>
> I want to see if I understand the implications of the per-slice-overflow patch
> for execution of hashjoin:
> For each bucket in the hashtable, when attempting to double the number of
> batches, if the memory that the BufFile structs will occupy once this is done
> will exceed the work_mem, split each batch into slices that fit into memory.
> This means that, for each probe-side tuple hashing to that bucket, you have to
> load every slice of each batch separately into memory to ensure correct results.
> Is this right?

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice. But I wonder how we'd deal with
outer joins, as Tom Lane asked in another thread:

https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us

>> I'm not entirely sure which of those approaches is the right one. The
>> first one is clearly just a "damage control" for cases where the hash
>> side turned out to be much larger than we expected. With good estimates
>> we probably would not have picked a hash join for those (that is, we
>> should have realized we can't keep work_mem and prohibit hash join).
>>
>> The second patch however makes hash join viable for some of those cases,
>> and it seems to work pretty well (there are some numbers in the message
>> posted to pgsql-performance thread). So I kinda like this second one.
>>
> So, my initial reaction after taking a look at the patches is that I prefer the
> first approach--increasing the resize threshhold. The second patch, the
> per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
> address what is, based on my understanding, an edge case.

Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.

Stepping back a bit, I think there is something fishy about the way we
detect extreme skew. Is that a factor in this case? Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory. Oops.

I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up! You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M. In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch. I'm not sure how you
pick the number.

Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths. I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme. I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.

I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-05-07 01:55:06 Re: Regression test PANICs with master-standby setup on same machine
Previous Message Kyotaro HORIGUCHI 2019-05-07 01:31:59 Re: Add tablespace tap test to pg_rewind