|From:||Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>|
|Subject:||accounting for memory used for BufFile during hash joins|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
I'm starting this thread mostly to keep track of patches developed in
response to issue  reported on pgsql-performance. The symptoms are
very simple - query performing a hash join ends up using much more
memory than expected (pretty much ignoring work_mem), and possibly
ending up with OOM.
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.
This is not ideal even if we happen to estimate everything correctly,
because for example with work_mem=4MB and nbatch=1024, it means we'll
use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the
work_mem for hash table itself.
But it can easily explode when we under-estimate the hash side. In the
pgsql-performance message, the hash side (with the patches applied,
allowing the query to complete) it looks like this:
Hash (cost=2823846.37..2823846.37 rows=34619 width=930)
(actual time=252946.367..252946.367 rows=113478127 loops=1)
So it's 3277x under-estimated. It starts with 16 batches, and ends up
adding more and more batches until it fails with 524288 of them (it gets
to that many batches because some of the values are very common and we
don't disable the growth earlier).
The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.
The two attached patches both account for the BufFile memory, but then
use very different strategies when the work_mem limit is reached.
The first patch realizes it's impossible to keep adding batches without
breaking the work_mem limit, because at some point the BufFile will need
more memory than that. But it does not make sense to stop adding batches
entirely, because then the hash table could grow indefinitely.
So the patch abandons the idea of enforcing work_mem in this situation,
and instead attempts to minimize memory usage over time - it increases
the spaceAllowed in a way that ensures doubling the number of batches
actually reduces memory usage in the long run.
The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).
Neither of those patches tweaks ExecChooseHashTableSize() to consider
memory needed for BufFiles while deciding how many batches will be
needed. That's something that probably needs to happen, but it would not
help with the underestimate issue.
I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).
The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.
It's all just PoC quality, at this point, far from committable state.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
|Next Message||Andres Freund||2019-05-04 00:45:16||Re: improving wraparound behavior|
|Previous Message||David Rowley||2019-05-03 23:06:46||Re: using index or check in ALTER TABLE SET NOT NULL|