Re: Tuplesort merge pre-reading

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tuplesort merge pre-reading
Date: 2016-09-15 20:51:19
Message-ID: 1541ef84-c2e4-26b6-03a3-19d22589caa2@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/15/2016 10:12 PM, Peter Geoghegan wrote:
> On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> Spotted another issue with this code just now. Shouldn't it be based
>>> on state->tapeRange? You don't want the destTape to get memory, since
>>> you don't use batch memory for tapes that are written to (and final
>>> on-the-fly merges don't use their destTape at all).
>
>>> Wait, you're using a local variable maxTapes here, which potentially
>>> differs from state->maxTapes:
>
>> I changed that so that it does actually change state->maxTapes. I considered
>> having a separate numTapes field, that can be smaller than maxTapes, but we
>> don't need the original maxTapes value after that point anymore, so it
>> would've been just pro forma to track them separately. I hope the comment
>> now explains that better.
>
> I still don't get why you're doing all of this within mergeruns() (the
> beginning of when we start merging -- we merge all quicksorted runs),
> rather than within beginmerge() (the beginning of one particular merge
> pass, of which there are potentially more than one). As runs are
> merged in a non-final merge pass, fewer tapes will remain active for
> the next merge pass. It doesn't do to do all that up-front when we
> have multiple merge passes, which will happen from time to time.

Now that the pre-reading is done in logtape.c, it doesn't stop at a run
boundary. For example, when we read the last 1 MB of the first run on a
tape, and we're using a 10 MB read buffer, we will merrily also read the
first 9 MB from the next run. You cannot un-read that data, even if the
tape is inactive in the next merge pass.

I don't think it makes much difference in practice, because most merge
passes use all, or almost all, of the available tapes. BTW, I think the
polyphase algorithm prefers to do all the merges that don't use all
tapes upfront, so that the last final merge always uses all the tapes.
I'm not 100% sure about that, but that's my understanding of the
algorithm, and that's what I've seen in my testing.

> Correct me if I'm wrong, but I think that you're more skeptical of the
> need for polyphase merge than I am. I at least see no reason to not
> keep it around. I also think it has some value. It doesn't make this
> optimization any harder, really.

We certainly still need multi-pass merges.

BTW, does a 1-way merge make any sense? I was surprised to see this in
the log, even without this patch:

LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-09-15 20:59:00 Re: Renaming some binaries
Previous Message Tom Lane 2016-09-15 20:48:59 Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)