Skip site navigation (1) Skip section navigation (2)

Logical tape pause/resume

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Logical tape pause/resume
Date: 2016-10-04 10:47:30
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] 
is to put a cap on the number of tapes to use for the sort. The reason 
is that each tape consumes 3 * BLCKSZ worth of memory, for the buffers. 
We decide the max number of tapes upfront, so if we decide that e.g. the 
max number of tapes is 1000, we reserve 1000 * 3 * BLCKSZ bytes of 
memory for the buffers, and that memory cannot then be used for the 
quicksorts to build the initial runs, even if it turns out later that we 
needed only a few tapes.

That amounts to about 8% of the available memory. That's quite wasteful. 
Peter's approach of putting an arbitrary cap on the max number of tapes 
works, but I find it a bit hackish. And you still waste that 8% with 
smaller work_mem settings.

When we finish writing an initial run to a tape, we keep the tape 
buffers around. That's the reason we need to reserve that memory. But 
there's no fundamental reason we have to do that. As I suggested in [2], 
we could flush the buffers to disk, and only keep in memory the location 
of the last block we wrote. If we need to write another run to the tape, 
we can reload the last incomplete block from the disk, and continue writing.

Reloading the last block, requires an extra I/O. That's OK. It's quite 
rare to have a multi-pass sort these days, so it's a good bet that you 
don't need to do it. And if you have a very small work_mem, so that you 
need to do a multi-pass sort, having some more memory available for 
building the initial runs is probably worth it, and there's also a good 
chance that the block is still in the OS cache.

So, here are two patches to that end:

1. The first patch changes the way we store the logical tapes on disk. 
Instead of using indirect blocks, HFS style, the blocks form a linked 
list. Each block has a trailer, with the block numbers of the previous 
and next block of the tape. That eliminates the indirect blocks, which 
simplifies the code quite a bit, and makes it simpler to implement the 
pause/resume functionality in the second patch. It also immediately 
reduces the memory needed for the buffers, from 3 to 1 block per tape, 
as we don't need to hold the indirect blocks in memory.

2. The second patch adds a new function LogicalTapePause(). A paused 
tape can be be written to, but it doesn't have an in-memory buffer. The 
last, incomplete block is written to disk, but if you call 
LogicalTapeWrite(), a new buffer is allocated and the last block is read 
back into memory. If you don't write to the tape anymore, and call 
LogicalTapeRewind() after LogicalTapePause(), nothing needs to be done, 
as the last block is already on disk.

In addition to saving a little bit of memory, I'd like to do this 
refactoring because it simplifies the code. It's code that has stayed 
largely unchanged for the past 15 years, so I'm not too eager to touch 
it, but looking at the changes coming with Peter's parallel tuplesort 
patch set, I think this makes sense. The parallel tuplesort patch set 
adds code for "serializing" and "deserializing" a tape, which means 
writing the top indirect block to disk, so that it can be read back in 
in the leader process. That's awfully similar to the "pause/resume" 
functionality here, so by adding it now, we won't have to do it in the 
parallel tuplesort patch. (Admittedly, it's not a whole lot of code, but 

[1] CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ(at)mail(dot)gmail(dot)com
[2] e8f44b63-4745-b855-7772-e8201906a4a1(at)iki(dot)fi

- Heikki

Attachment: 0001-Simplify-tape-block-format.patch
Description: text/x-patch (28.7 KB)
Attachment: 0002-Pause-resume-support.patch
Description: text/x-patch (7.3 KB)


pgsql-hackers by date

Next:From: Simon RiggsDate: 2016-10-04 11:09:05
Subject: Re: Logical tape pause/resume
Previous:From: Francisco OlarteDate: 2016-10-04 09:35:33
Subject: Re: Question / requests.

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group