Re: Tuplesort merge pre-reading

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tuplesort merge pre-reading
Date: 2016-09-14 17:43:36
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Addressed all your comments one way or another, new patch attached.
Comments on some specific points below:

On 09/12/2016 01:13 AM, Peter Geoghegan wrote:
> Other things I noticed:
> * You should probably point out that typically, access to batch memory
> will still be sequential, despite your block-based scheme. The
> preloading will now more or less make that the normal case. Any
> fragmentation will now be essentially in memory, not on disk, which is
> a big win.

That's not true, the "buffers" in batch memory are not accessed
sequentially. When we pull the next tuple from a tape, we store it in
the next free buffer. Usually, that buffer was used to hold the previous
tuple that was returned from gettuple(), and was just added to the free

It's still quite cache-friendly, though, because we only need a small
number of slots (one for each tape).

> * i think you should move "bool *mergeactive; /* active input run
> source? */" within Tuplesortstate to be next to the other batch memory
> stuff. No point in having separate merge and batch "sections" there
> anymore.

Hmm. I think I prefer to keep the memory management stuff in a separate
section. While it's true that it's currently only used during merging,
it's not hard to imagine using it when building the initial runs, for
example. Except for replacement selection, the pattern for building the
runs is: add a bunch of tuples, sort, flush them out. It would be
straightforward to use one large chunk of memory to hold all the tuples.
I'm not going to do that now, but I think keeping the memory management
stuff separate from merge-related state makes sense.

> * I think that you need to comment on why state->tuplecontext is not
> used for batch memory now. It is still useful, for multiple merge
> passes, but the situation has notably changed for it.

Hmm. We don't actually use it after the initial runs at all anymore. I
added a call to destroy it in mergeruns().

Now that we use the batch memory buffers for allocations < 1 kB (I
pulled that number out of a hat, BTW), and we only need one allocation
per tape (plus one), there's not much risk of fragmentation.

> On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> * Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:
>>> @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
>>> Assert(lt->frozen);
>>> datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
>>> }
>>> +
>>> + /* Allocate a read buffer */
>>> + if (lt->buffer)
>>> + pfree(lt->buffer);
>>> + lt->buffer = palloc(lt->read_buffer_size);
>>> + lt->buffer_size = lt->read_buffer_size;
> Of course, when you do that you're going to have to make the new
> "buffer_size" and "read_buffer_size" fields of type "Size" (or,
> possibly, "int64", to match tuplesort.c's own buffer sizing variables
> ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
> and "nbytes" fields next to "buffer_size" within the struct
> LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
> "long", though, since that reflects an existing hardly-relevant
> limitation that you're not making any worse.

True. I fixed that by putting a MaxAllocSize cap on the buffer size
instead. I doubt that doing > 1 GB of read-ahead of a single tape will
do any good.

I wonder if we should actually have a smaller cap there. Even 1 GB seems
excessive. Might be better to start the merging sooner, rather than wait
for the read of 1 GB to complete. The point of the OS readahead is that
the OS will do that for us, in the background. And other processes might
have better use for the memory anyway.

> * It couldn't hurt to make this code paranoid about LACKMEM() becoming
> true, which will cause havoc (we saw this recently in 9.6; a patch of
> mine to fix that just went in):
>> + /*
>> + * Use all the spare memory we have available for read buffers. Divide it
>> + * memory evenly among all the tapes.
>> + */
>> + avail_blocks = state->availMem / BLCKSZ;
>> + per_tape = avail_blocks / maxTapes;
>> + cutoff = avail_blocks % maxTapes;
>> + if (per_tape == 0)
>> + {
>> + per_tape = 1;
>> + cutoff = 0;
>> + }
>> + for (tapenum = 0; tapenum < maxTapes; tapenum++)
>> + {
>> + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>> + (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
>> + }
> In other words, we really don't want availMem to become < 0, since
> it's int64, but a derived value is passed to
> LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
> if LACKMEM() did happen it would be a bug anyway, but I recommend
> defensive code also be added. Per grow_memtuples(), "We need to be
> sure that we do not cause LACKMEM to become true, else the space
> management algorithm will go nuts". Let's be sure that we get that
> right, since, as we saw recently, especially since grow_memtuples()
> will not actually have the chance to save us here (if there is a bug
> along these lines, let's at least make the right "can't happen error"
> complaint to user when it pops up).

Hmm. We don't really need the availMem accounting at all, after we have
started merging. There is nothing we can do to free memory if we run
out, and we use fairly little memory anyway. But yes, better safe than
sorry. I tried to clarify the comments on that.

> * It looks like your patch makes us less eager about freeing per-tape
> batch memory, now held as preload buffer space within logtape.c.
> ISTM that there should be some way to have the "tape exhausted" code
> path within tuplesort_gettuple_common() (as well as the similar spot
> within mergeonerun()) instruct logtape.c that we're done with that
> tape. In other words, when mergeprereadone() (now renamed to
> mergereadnext()) detects the tape is exhausted, it should have
> logtape.c free its huge tape buffer immediately. Think about cases
> where input is presorted, and earlier tapes can be freed quite early
> on. It's worth keeping that around, (you removed the old way that this
> happened, through mergebatchfreetape()).

OK. I solved that by calling LogicalTapeRewind(), when we're done
reading a tape. Rewinding a tape has the side-effect of freeing the
buffer. I was going to put that into mergereadnext(), but it turns out
that it's tricky to figure out if there are any more runs on the same
tape, because we have the "actual" tape number there, but the tp_runs is
indexed by "logical" tape number. So I put the rewind calls at the end
of mergeruns(), and in TSS_FINALMERGE processing, instead. It means that
we don't free the buffers quite as early as we could, but I think this
is good enough.

> On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>> + for (tapenum = 0; tapenum < maxTapes; tapenum++)
>>> + {
>>> + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>>> + (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
>>> + }
> Spotted another issue with this code just now. Shouldn't it be based
> on state->tapeRange? You don't want the destTape to get memory, since
> you don't use batch memory for tapes that are written to (and final
> on-the-fly merges don't use their destTape at all).

logtape.c will only actually allocate the memory when reading.

> (Looks again...)
> Wait, you're using a local variable maxTapes here, which potentially
> differs from state->maxTapes:
>> + /*
>> + * If we had fewer runs than tapes, refund buffers for tapes that were never
>> + * allocated.
>> + */
>> + maxTapes = state->maxTapes;
>> + if (state->currentRun < maxTapes)
>> + {
>> + FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
>> + maxTapes = state->currentRun;
>> + }
> I find this confusing, and think it's probably buggy. I don't think
> you should have a local variable called maxTapes that you modify at
> all, since state->maxTapes is supposed to not change once established.

I changed that so that it does actually change state->maxTapes. I
considered having a separate numTapes field, that can be smaller than
maxTapes, but we don't need the original maxTapes value after that point
anymore, so it would've been just pro forma to track them separately. I
hope the comment now explains that better.

> You can't use state->currentRun like that, either, I think, because
> it's the high watermark number of runs (quicksorted runs), not runs
> after any particular merge phase, where we end up with fewer runs as
> they're merged (we must also consider dummy runs to get this) -- we
> want something like activeTapes. cf. the code you removed for the
> beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
> there are multiple merge passes, which suggests all this code really
> has no business being in mergeruns() (it should instead be in
> beginmerge(), or code that beginmerge() reliably calls).

Hmm, yes, using currentRun here is wrong. It needs to be "currentRun +
1", because we need one more tape than there are runs, to hold the output.

Note that I'm not re-allocating the read buffers depending on which
tapes are used in the current merge pass. I'm just dividing up the
memory among all tapes. Now that the pre-reading is done in logtape.c,
when we reach the end of a run on a tape, we will already have data from
the next run on the same tape in the read buffer.

> Immediately afterwards, you do this:
>> + /* Initialize the merge tuple buffer arena. */
>> + state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
>> + state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
>> + state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
>> + USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
> The fact that you size the buffer based on "maxTapes + 1" also
> suggests a problem. I think that the code looks like this because it
> must instruct logtape.c that the destTape tape requires some buffer
> (iff there is to be a non-final merge). Is that right? I hope that you
> don't give the typically unused destTape tape a full share of batch
> memory all the time (the same applies to any other
> inactive-at-final-merge tapes).

Ah, no, the "+ 1" comes from the need to hold the tuple that we last
returned to the caller in tuplesort_gettuple, until the next call. See
lastReturnedTuple. I tried to clarify the comments on that.

Thanks for the thorough review! Let me know how this looks now.

- Heikki

Attachment Content-Type Size
0001-Change-the-way-pre-reading-in-external-sort-s-merge-2.patch text/x-patch 70.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-09-14 17:57:42 Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message Pavan Deolasee 2016-09-14 17:40:09 Re: Vacuum: allow usage of more than 1GB of work mem