Re: accounting for memory used for BufFile during hash joins

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Hubert Zhang <hzhang(at)pivotal(dot)io>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-07-10 23:51:02
Message-ID: CAAKRu_Yiam-=06L+R8FR+Vaceb-ozQzzMqRiY2pDYku1VdZ=Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.

1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.

2) Parallel HashJoin implementation of this patch's approach

I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.

I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).

I'll start with 2 since I have less to say there.

From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?

Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).

You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.

Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:

Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":

> A. Broken escape valve: sometimes we generate a huge number of
> batches while trying to split up many duplicates, because of the
> presence of other more uniformly distributed keys. We could fix that
> with (say) a 95% rule.
> B. Lack of good alternative execution strategy when the escape valve
> is triggered. A batch cannot be split effectively, but cannot fit in
> work_mem, so for now we decide to ignore work_mem.
> C. Unmetered explosion of batches and thus BufFiles, probably usually
> caused by problem A, but theoretically also due to a real need for
> partitions.

However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).

The following scenarios are how you could end up running out of
memory:

Y. Plan-time underestimation of the number of required batches with
relatively uniform data distribution

In this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead

Z. Plan and or execution time underestimation of the number of
required batches with skewed data

If you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.

If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.

To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)

If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.

Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]).

For slicing, each unit, a slice, has multiple batches but one spill
file.
For hbnlj, each unit, a chunk, is one of multiple chunks in a single
batch, all of which are in the same spill file (1 batch = 1 spill
file).

Thinking through it, it seems to make the most sense to split the work
into ~ 3 independent pieces:

patch1 - "preview" a batch increase (not yet written [I think])
patch2 - slicing (Tomas' patch [2] but add in threshhold for portion of
work_mem buffile overhead is using)
patch3 - hash block nested loop join (my patch [1])

patch1 allows us to re-enable growth and was mentioned upthread, but I
will quote it here for simplicity:

> I think we can fix A by relaxing the escape valve condition, and then
> rechecking it once in a while. So we fill work_mem, realize it didn't
> actually reduce the batch size significantly and disable nbatch growth.
> But at the same time we increase the threshold to 2x work_mem, and after
> reaching it we "consider" a nbatch increase. That is, we walk the batch
> and see how many tuples would move if we increased nbatch (that should be
> fairly cheap) - if it helps, great, enable growth and split the batch. If
> not, double the threshold again. Rinse and repeat.

We don't want to fill up work_mem with buffile overhead after
increasing nbatches many times just to move a few tuples for one batch
and end up disabling growth thus making it so that later we can't
increase nbatches and repartition for a batch that would nicely
partition (like Hubert's case, I believe [3]).

We want to identify when re-partitioning would help and only do it
then and, for times when it wouldn't help, use a fallback strategy
that still allows progress on the hashjoin, and, for some spiky data,
where we have re-partitioned for the right reasons, but there are
still a lot of batches that are small enough that they could all fit
in memory at once, we want to track them with as little overhead as
possible -- lump them into slices.

We should probably consider deciding to use slices based on some
threshold for the portion of work_mem which is allowed to be occupied
by buffile overhead instead of waiting until the buffile overhead is
literally taking up most of work_mem.

The above summary is to address the concern in this thread about a
holistic solution.

I think the slicing patch is independent of both the hash block nested
loop join patch and the "preview" mode for batch increasing.

If slicing is made to work for parallel-aware hashjoin and the code is
in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.

[1]
https://www.postgresql.org/message-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/20190508150844.rij36rtuk4lhvztw%40development
[3]
https://www.postgresql.org/message-id/CAB0yre%3De8ysPyoUvZqjKYAxc6-VB%3DJKHL-7XKZSxy0FT5vY7BQ%40mail.gmail.com

--
Melanie Plageman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-07-10 23:52:04 Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11
Previous Message Ryan Lambert 2019-07-10 23:02:20 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)