Re: Out of Memory errors are frustrating as heck!

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Gunther <raj(at)gusw(dot)net>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Out of Memory errors are frustrating as heck!
Date: 2019-04-28 14:19:01
Message-ID: 20190428141901.5dsbge2ka3rxmpk6@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote:
>
> ...
>
>I still think the idea with an "overflow batch" is worth considering,
>because it'd allow us to keep the memory usage within work_mem. And
>after getting familiar with the hash join code again (haven't messed
>with it since 9.5 or so) I think it should not be all that difficult.
>I'll give it a try over the weekend if I get bored for a while.
>

OK, so I took a stab at this, and overall it seems to be workable. The
patches I have are nowhere near committable, but I think the approach
works fairly well - the memory is kept in check, and the performance is
comparable to the "ballancing" approach tested before.

To explain it a bit, the idea is that we can compute how many BufFile
structures we can keep in memory - we can't use more than work_mem/2 for
that, because then we'd mostly eliminate space for the actual data. For
example with 4MB, we know we can keep 128 batches - we need 128 for
outer and inner side, so 256 in total, and 256*8kB = 2MB.

And then, we just increase the number of batches but instead of adding
the BufFile entries, we split batches into slices that we can keep in
memory (say, the 128 batches). And we keep BufFiles for the current one
and an "overflow file" for the other slices. After processing a slice,
we simply switch to the next one, and use the overflow file as a temp
file for the first batch - we redistribute it into the other batches in
the slice and another overflow file.

That's what the v3 patch (named 'single overflow file') does. I does
work, but unfortunately it significantly inflates the amount of data
written to temporary files. Assume we need e.g. 1024 batches, but only
128 fit into memory. That means we'll need 8 slices, and during the
first pass we'll handle 1/8 of the data and write 7/8 to the overflow
file. Then after processing the slice and switching to the next one, we
repeat this dance - 1/8 gets processed, 6/8 written to another overflow
file. So essentially we "forward" about

7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5

of data between slices, and we need to re-shuffle data in each slice,
which amounts to additional 1x data. That's pretty significant overhead,
as will be clear from the measurements I'll present shortly.

But luckily, there's a simple solution to this - instead of writing the
data into a single overflow file, we can create one overflow file for
each slice. That will leave us with the ~1x of additional writes when
distributing data into batches in the current slice, but it eliminates
the main source of write amplification - awalanche-like forwarding of
data between slices.

This relaxes the memory limit a bit again, because we can't really keep
the number of overflow files constrained by work_mem, but we should only
need few of them (much less than when adding one file per batch right
away). For example with 128 in-memory batches, this reduces the amount
of necessary memory 128x.

And this is what v4 (per-slice overflow file) does, pretty much.

Two more comments, regarding memory accounting in previous patches. It
was a bit broken, because we actually need 2x the number of BufFiles. We
needed nbatch files for outer side and nbatch files for inner side, but
we only considered one of those - both when deciding when to increase
the number of batches / increase spaceAllowed, and when reporting the
memory usage. So with large number of batches the reported amount of
used memory was roughly 1/2 of the actual value :-/

The memory accounting was a bit bogus for another reason - spaceUsed
simply tracks the amount of memory for hash table contents. But at the
end we were simply adding the current space for BufFile stuff, ignoring
the fact that that's likely much larger than when the spacePeak value
got stored. For example we might have kept early spaceUsed when it was
almost work_mem, and then added the final large BufFile allocation.

I've fixed both issues in the patches attached to this message. It does
not make a huge difference in practice, but it makes it easier to
compare values between patches.

Now, some test results - I've repeated the simple test with uniform data
set, which is pretty much ideal for hash joins (no unexlectedly large
batches that can't be split, etc.). I've done this with 1M, 5M, 10M, 25M
and 50M rows in the large table (which gets picked for the "hash" side),
and measured how much memory gets used, how many batches, how long it
takes and how much data gets written to temp files.

See the hashjoin-test.sh script for more details.

So, here are the results with work_mem = 4MB (so the number of in-memory
batches for the last two entries is 128). The columns are:

* nbatch - the final number of batches
* memory - memory usage, as reported by explain analyze
* time - duration of the query (without explain analyze) in seconds
* size - size of the large table
* temp - amount of data written to temp files
* amplif - write amplification (temp / size)

1M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 256 7681 3.3 730 899 1.23
rebalance 256 7711 3.3 730 884 1.21
single file 1024 4161 7.2 730 3168 4.34
per-slice file 1024 4161 4.7 730 1653 2.26

5M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 2048 36353 22 3652 5276 1.44
rebalance 512 16515 18 3652 4169 1.14
single file 4096 4353 156 3652 53897 14.76
per-slice file 4096 4353 28 3652 8106 2.21

10M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 4096 69121 61 7303 10556 1.45
rebalance 512 24326 46 7303 7405 1.01
single file 8192 4636 762 7303 211234 28.92
per-slice file 8192 4636 65 7303 16278 2.23

25M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 8192 134657 190 7303 24279 1.33
rebalance 1024 36611 158 7303 20024 1.10
single file 16384 6011 4054 7303 1046174 57.32
per-slice file 16384 6011 207 7303 39073 2.14

50M rows
===================================================================
nbatch memory time size (MB) temp (MB) amplif
-------------------------------------------------------------------
master 16384 265729 531 36500 48519 1.33
rebalance 2048 53241 447 36500 48077 1.32
single file - - - 36500 - -
per-slice file 32768 8125 451 36500 78662 2.16

From those numbers it's pretty clear that per-slice overflow file does
by far the best job in enforcing work_mem and minimizing the amount of
data spilled to temp files. It does write a bit more data than both
master and the simple rebalancing, but that's the cost for enforcing
work_mem more strictly. It's generally a bit slower than those two
approaches, although on the largest scale it's actually a bit faster
than master. I think that's pretty acceptable, considering this is meant
to address extreme underestimates where we currently just eat memory.

The case with single overflow file performs rather poorly - I haven't
even collected data from the largest scale, but considering it spilled
1TB of temp files with a dataset half the size, that's not an issue.
(Note that this does not mean it needs 1TB of temp space, those writes
are spread over time and the files are created/closed as we go. The
system only has ~100GB of free disk space.)

Gunther, could you try the v2 and v4 patches on your data set? That
would be an interesting data point, I think.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v2-simple-rebalance.patch text/plain 5.5 KB
v3-single-overflow-file.patch text/plain 25.1 KB
v4-per-slice-overflow-file.patch text/plain 26.7 KB
hashjoin-test.sh application/x-sh 2.0 KB

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Naik, Sameer 2019-04-29 10:36:20 Generic Plans for Prepared Statement are 158155 times slower than Custom Plans
Previous Message Tomas Vondra 2019-04-24 00:36:33 Re: Out of Memory errors are frustrating as heck!