Quick Links

Re: Tuplesort merge pre-reading

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Tuplesort merge pre-reading
Date:	2016-09-15 19:12:43
Message-ID:	CAM3SWZQVpNErrnNv-N0YSDpPqPcRAaQ0uth5Td06+FqhUigkdQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Addressed all your comments one way or another, new patch attached. Comments
> on some specific points below:

Cool. My response here is written under time pressure, which is not
ideal. I think it's still useful, though.

>> * You should probably point out that typically, access to batch memory
>> will still be sequential, despite your block-based scheme.

> That's not true, the "buffers" in batch memory are not accessed
> sequentially. When we pull the next tuple from a tape, we store it in the
> next free buffer. Usually, that buffer was used to hold the previous tuple
> that was returned from gettuple(), and was just added to the free list.
>
> It's still quite cache-friendly, though, because we only need a small number
> of slots (one for each tape).

That's kind of what I meant, I think -- it's more or less sequential.
Especially in the common case where there is only one merge pass.

> True. I fixed that by putting a MaxAllocSize cap on the buffer size instead.
> I doubt that doing > 1 GB of read-ahead of a single tape will do any good.

You may well be right about that, but ideally that could be verified.
I think that the tuplesort is entitled to have whatever memory the
user makes available, unless that's almost certainly useless. It
doesn't seem like our job to judge that it's always wrong to use extra
memory with only a small expected benefit. If it's actually a
microscopic expected benefit, or just as often negative to the sort
operation's performance, then I'd say it's okay to cap it at
MaxAllocSize. But it's not obvious to me that this is true; not yet,
anyway.

> Hmm. We don't really need the availMem accounting at all, after we have
> started merging. There is nothing we can do to free memory if we run out,
> and we use fairly little memory anyway. But yes, better safe than sorry. I
> tried to clarify the comments on that.

It is true that we don't really care about the accounting at all. But,
the same applies to the existing grow_memtuples() case at the
beginning of merging. The point is, we *do* care about availMem, this
one last time. We must at least produce a sane (i.e. >= 0) number in
any calculation. (I think you understand this already -- just saying.)

> OK. I solved that by calling LogicalTapeRewind(), when we're done reading a
> tape. Rewinding a tape has the side-effect of freeing the buffer. I was
> going to put that into mergereadnext(), but it turns out that it's tricky to
> figure out if there are any more runs on the same tape, because we have the
> "actual" tape number there, but the tp_runs is indexed by "logical" tape
> number. So I put the rewind calls at the end of mergeruns(), and in
> TSS_FINALMERGE processing, instead. It means that we don't free the buffers
> quite as early as we could, but I think this is good enough.

That seems adequate.

>> Spotted another issue with this code just now. Shouldn't it be based
>> on state->tapeRange? You don't want the destTape to get memory, since
>> you don't use batch memory for tapes that are written to (and final
>> on-the-fly merges don't use their destTape at all).

>> Wait, you're using a local variable maxTapes here, which potentially
>> differs from state->maxTapes:

> I changed that so that it does actually change state->maxTapes. I considered
> having a separate numTapes field, that can be smaller than maxTapes, but we
> don't need the original maxTapes value after that point anymore, so it
> would've been just pro forma to track them separately. I hope the comment
> now explains that better.

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Correct me if I'm wrong, but I think that you're more skeptical of the
need for polyphase merge than I am. I at least see no reason to not
keep it around. I also think it has some value. It doesn't make this
optimization any harder, really.

> Hmm, yes, using currentRun here is wrong. It needs to be "currentRun + 1",
> because we need one more tape than there are runs, to hold the output.

As I said, I think it should be the number of active tapes, as you see
today within beginmerge() + mergebatch(). Why not do it that way? If
you don't get the distinction, see my remarks below on final merges
always using batch memory, even when there are to be multiple merge
passes (no reason to add that restriction here). More generally, I
really don't want to mess with the long standing definition of
maxTapes and things like that, because I see no advantage.

> Ah, no, the "+ 1" comes from the need to hold the tuple that we last
> returned to the caller in tuplesort_gettuple, until the next call. See
> lastReturnedTuple. I tried to clarify the comments on that.

I see. I don't think that you need to do any of that. Just have the
sizing be based on an even share among activeTapes on final merges
(not necessarily final on-the-fly merges, per my 0002-* patch --
you've done that here, it looks like). However, It looks like you're
doing the wrong thing by only having the check at the top of
beginmerge() -- "if (state->Level == 1)". You're not going to use
batch memory for the final merge just because there happened to be
multiple merges, that way. Sure, you shouldn't be using batch memory
for non-final merges (due to their need for an uneven amount of memory
per active tape), which you're not doing, but the mere fact that you
had a non-final merge should not affect the final merge at all. Which
is the kind of thing I'm concerned about when I talk about
beginmerge() being the right place for all that new code (not
mergeruns()). Even if you can make it work, it fits a lot better to
put it in beginmerge() -- that's the existing flow of things, which
should be preserved. I don't think you need to examine "state->Level"
or "state->currentRun" at all.

Did you do it this way to make the "replacement selection best case"
stuff within mergeruns() catch on, so it does the right thing during
its TSS_SORTEDONTAPE processing?

Thanks
--
Peter Geoghegan

In response to

Re: Tuplesort merge pre-reading at 2016-09-14 17:43:36 from Heikki Linnakangas

Responses

Re: Tuplesort merge pre-reading at 2016-09-15 20:51:19 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2016-09-15 19:23:58	Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Previous Message	Claudio Freire	2016-09-15 18:54:41	Re: Vacuum: allow usage of more than 1GB of work mem