Re: Logical tape pause/resume

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Logical tape pause/resume
Date: 2016-10-04 11:47:29
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/04/2016 02:09 PM, Simon Riggs wrote:
> On 4 October 2016 at 11:47, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> When we finish writing an initial run to a tape, we keep the tape buffers
>> around. That's the reason we need to reserve that memory. But there's no
>> fundamental reason we have to do that. As I suggested in [2], we could flush
>> the buffers to disk, and only keep in memory the location of the last block
>> we wrote. If we need to write another run to the tape, we can reload the
>> last incomplete block from the disk, and continue writing.
>> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
>> to have a multi-pass sort these days, so it's a good bet that you don't need
>> to do it. And if you have a very small work_mem, so that you need to do a
>> multi-pass sort, having some more memory available for building the initial
>> runs is probably worth it, and there's also a good chance that the block is
>> still in the OS cache.
> Sounds like a good idea.
> Why not just make each new run start at a block boundary?
> That way we waste on average BLCKSZ/2 disk space per run, which is
> negligible but we avoid any need to have code to read back in the last
> block.

Hmm. You'd still have to read back the last block, so that you can
update its next-pointer.

What you could do, though, is to always store only one run on each tape.
If we can make the in-memory representation of a tape small enough, that
might be acceptable. You only need 4 bytes to store the starting block
number. Or we could dump the starting block numbers of the tapes, to yet
another tape. The number of tapes you merge in one pass would still be
limited by the amount of memory you have, but the number of tapes would
be unlimited.

I don't want to do that right now, it gets more invasive. Something to
keep in mind for the future, though. This also related to your other
suggestion below:

> That also makes it easier to put each run in its own file, which will
> surely help with parallelism and compression since we can spread
> multiple run files across multiple temp tablespaces.
> Can we also please read in the value from the last tuple in a run, so
> we have min and max values in memory for each run? That can be used
> during the merge step to avoid merging runs unless the value ranges
> overlap.

This gets more advanced... Yeah, stuff like that would make sense. You
could also split the tapes into smaller parts, and store the min/max for
each part, to make the merging even more efficient and easier to
parallelize. Or to take that to the extreme, you could make each tape a
little B-tree.

But that's for another day and another patch :-). What I have now is a
drop-in replacement for the current logical tape implementation. These
more advanced schemes would build on top of that.

- Heikki

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Khandekar 2016-10-04 11:53:50 Re: asynchronous execution
Previous Message Julian Markwort 2016-10-04 11:42:26 Re: [PATCH] pgpassfile connection option