Re: Tuplesort merge pre-reading

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tuplesort merge pre-reading
Date: 2016-09-11 23:39:54
Message-ID: CAM3SWZQ4mNMUT+=g9ChKqbYHnddcN1atAeda4aJoXCm1sw2SRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> + for (tapenum = 0; tapenum < maxTapes; tapenum++)
>> + {
>> + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>> + (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
>> + }

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

(Looks again...)

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

> + /*
> + * If we had fewer runs than tapes, refund buffers for tapes that were never
> + * allocated.
> + */
> + maxTapes = state->maxTapes;
> + if (state->currentRun < maxTapes)
> + {
> + FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
> + maxTapes = state->currentRun;
> + }

I find this confusing, and think it's probably buggy. I don't think
you should have a local variable called maxTapes that you modify at
all, since state->maxTapes is supposed to not change once established.
You can't use state->currentRun like that, either, I think, because
it's the high watermark number of runs (quicksorted runs), not runs
after any particular merge phase, where we end up with fewer runs as
they're merged (we must also consider dummy runs to get this) -- we
want something like activeTapes. cf. the code you removed for the
beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
there are multiple merge passes, which suggests all this code really
has no business being in mergeruns() (it should instead be in
beginmerge(), or code that beginmerge() reliably calls).

Immediately afterwards, you do this:

> + /* Initialize the merge tuple buffer arena. */
> + state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
> + state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
> + state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
> + USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);

The fact that you size the buffer based on "maxTapes + 1" also
suggests a problem. I think that the code looks like this because it
must instruct logtape.c that the destTape tape requires some buffer
(iff there is to be a non-final merge). Is that right? I hope that you
don't give the typically unused destTape tape a full share of batch
memory all the time (the same applies to any other
inactive-at-final-merge tapes).

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Flower 2016-09-12 00:16:23 Re: Tuplesort merge pre-reading
Previous Message Thomas Munro 2016-09-11 23:20:20 Re: Merge Join with an Index Optimization