Re: Logical tape pause/resume

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Logical tape pause/resume
Date: 2016-10-10 19:31:38
Message-ID: CAM3SWZRmAHbV6NTA-zj_xR_xExWA7Rio1AxJrYY8SjNV3dEQGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Regardless of the number of tapes, the memory used for the tape buffers,
> while building the initial runs, is wasted. It's not entirely wasted when
> you have multiple merge passes, because without the buffer you need to read
> the last partial page back into memory, when you start to output the next
> run on it. But even in that case, I believe that memory would be better used
> for the quicksort, to create larger runs.

Okay. Somehow, I was confused on that point.

> Yeah, there might be value in having a cap, because operating the merge heap
> becomes more expensive when it becomes larger. Mainly because of cache
> misses. We should perhaps have a limit on the size of the merge heap, and
> use the same limit for the replacement selection. Although, the heap behaves
> slightly differently during replacement selection: During replacement
> selection, new values added to the heap tend to go to the bottom of the
> heap, but during merge, new values tend to go close to the top. The latter
> pattern incurs fewer cache misses.

I don't think it makes sense to bring replacement selection into it.
I'd vote for completely killing replacement selection at this point.
The merge heap work committed a few weeks back really weakened the
case for it, a case that the 9.6 work left hanging by a thread to
begin with. In general, having a replacement selection heap that is
larger can only be helpful when that means that we have enough
"juggling capacity" to reorder things into one (or relatively few)
long runs. For RS, the optimal size of the heap is good enough to do
any juggling that is possible to make runs as long as possible, but no
larger. The default replacement_sort_tuples setting (150,000) seems
very high to me, considered from that point of view, actually.

Multiple passes just don't seem to be that bad on modern hardware, in
the cases where they still happen. You don't see a big drop when
tuning down work_mem to just below the threshold at which the sort can
complete in one pass. And, multi-pass sorts still won't happen very
often in practice. Now, this work of yours risks slowing down multiple
pass sorts when that does happen, but I think it could still be worth
it.

> But that's orthogonal to the wasting-memory aspect of this. Even if a we
> have a cap of 100, it's good to not spend memory for the tape buffers of
> those 100 tapes, when building the initial runs.

You're right about that.

>> Okay. But, you haven't actually addressed the problem with non-trivial
>> amounts of memory being logically allocated ahead of time for no good
>> reason -- you've only address the constant factor (the overhead
>> per-tape).
>
>
> I thought I did. Can you elaborate?
>
> Are you referring to the fact that the array of LogicalTapes is still
> allocated ahead of time, with size maxTapes? True, it'd be nice to allocate
> the LogicalTape structs only when needed. But that's peanuts, compared to
> the tape buffers.

Is that's all that is left to remove, in terms of wasteful USEMEM()
overhead? Then, yeah, I guess I am just talking about "peanuts", plus
the more significant issue of merge heap CPU cache efficiency.

>> I can definitely see value in refactoring, to make that code less
>> complicated -- I just don't think it's justified by the amount of
>> memory that is wasted on tapes.
>
>
> Ok, good. I think we're in agreement on doing this, then, even if we don't
> agree on the justification :-).

Agreed. :-)

> Yeah. I'm not sure how partitioning and all that would be done here. But I'm
> prettty sure this simpler logtape.c code is easier to work with, for
> partitioning too.

That's my intuition, too. Obviously my thoughts on partitioning are
still very hand-wavy, but I'm pretty sure that partitioning is the
future for the parallelization of sorting in the executor. Pushing
*everything* down is more important than actually having the sort be
as fast as possible there.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2016-10-10 20:52:35 Re: pg_dump getBlobs query broken for 7.3 servers
Previous Message Tom Lane 2016-10-10 19:17:13 Re: Switch to unnamed POSIX semaphores as our preferred sema code?