One of the patches in Peter Geoghegan's Parallel tuplesort patch set 
is to put a cap on the number of tapes to use for the sort. The reason
is that each tape consumes 3 * BLCKSZ worth of memory, for the buffers.
We decide the max number of tapes upfront, so if we decide that e.g. the
max number of tapes is 1000, we reserve 1000 * 3 * BLCKSZ bytes of
memory for the buffers, and that memory cannot then be used for the
quicksorts to build the initial runs, even if it turns out later that we
needed only a few tapes.
That amounts to about 8% of the available memory. That's quite wasteful.
Peter's approach of putting an arbitrary cap on the max number of tapes
works, but I find it a bit hackish. And you still waste that 8% with
smaller work_mem settings.
When we finish writing an initial run to a tape, we keep the tape
buffers around. That's the reason we need to reserve that memory. But
there's no fundamental reason we have to do that. As I suggested in ,
we could flush the buffers to disk, and only keep in memory the location
of the last block we wrote. If we need to write another run to the tape,
we can reload the last incomplete block from the disk, and continue writing.
Reloading the last block, requires an extra I/O. That's OK. It's quite
rare to have a multi-pass sort these days, so it's a good bet that you
don't need to do it. And if you have a very small work_mem, so that you
need to do a multi-pass sort, having some more memory available for
building the initial runs is probably worth it, and there's also a good
chance that the block is still in the OS cache.
So, here are two patches to that end:
1. The first patch changes the way we store the logical tapes on disk.
Instead of using indirect blocks, HFS style, the blocks form a linked
list. Each block has a trailer, with the block numbers of the previous
and next block of the tape. That eliminates the indirect blocks, which
simplifies the code quite a bit, and makes it simpler to implement the
pause/resume functionality in the second patch. It also immediately
reduces the memory needed for the buffers, from 3 to 1 block per tape,
as we don't need to hold the indirect blocks in memory.
2. The second patch adds a new function LogicalTapePause(). A paused
tape can be be written to, but it doesn't have an in-memory buffer. The
last, incomplete block is written to disk, but if you call
LogicalTapeWrite(), a new buffer is allocated and the last block is read
back into memory. If you don't write to the tape anymore, and call
LogicalTapeRewind() after LogicalTapePause(), nothing needs to be done,
as the last block is already on disk.
In addition to saving a little bit of memory, I'd like to do this
refactoring because it simplifies the code. It's code that has stayed
largely unchanged for the past 15 years, so I'm not too eager to touch
it, but looking at the changes coming with Peter's parallel tuplesort
patch set, I think this makes sense. The parallel tuplesort patch set
adds code for "serializing" and "deserializing" a tape, which means
writing the top indirect block to disk, so that it can be read back in
in the leader process. That's awfully similar to the "pause/resume"
functionality here, so by adding it now, we won't have to do it in the
parallel tuplesort patch. (Admittedly, it's not a whole lot of code, but
pgsql-hackers by date
|Next:||From: Simon Riggs||Date: 2016-10-04 11:09:05|
|Subject: Re: Logical tape pause/resume|
|Previous:||From: Francisco Olarte||Date: 2016-10-04 09:35:33|
|Subject: Re: Question / requests.|