Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeremy Harris <jgh(at)wizmail(dot)org>
Subject: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date: 2015-08-04 08:40:41
Message-ID: CAM3SWZQ-ZfoYjCfJoA2w6CCR3xgAqY0of=-DXAfW9Lpgn5UA5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Yeah, something like that. To paraphrase, if I'm now understanding it
> correctly, Peter's idea is:
>
> When all the tuples have been fed to tuplesort, and it's time to perform the
> sort, quicksort all the tuples currently in the heap, ignoring the run
> numbers, and turn the resulting array into another tape. That tape is
> special: it's actually stored completely in memory. It is merged with the
> "real" tapes when tuples are returned from the tuplesort, just like regular
> tapes in TSS_FINALMERGE.

Yeah. I imagine that we'll want to put memory prefetch hints for the
new case, since I've independently shown that that works well for the
in-memory case, which this can be very close to.

My next patch will also include quicksorting of runs after we give up
on heapification (after there is more than one run and it is
established that we cannot use my "quicksort with spillover"
optimization, so there are two or more "real" runs on tape). Once
there is clearly not going to be one huge run (which can happen due to
everything largely being in order, even when work_mem is small), and
once incrementally spilling does not end in time to do a "quicksort
with spillover", then the replacement selection thing isn't too
valuable. Especially with large memory sizes but memory bandwidth +
latency as a bottleneck, which is the norm these days.

This seems simpler than my earlier idea of reusing half the memtuples
array only, and resorting the entire array each time, to have
something that consistently approximates replacement selection but
with quicksorting + batching, which I discussed before.

I have this working, and it takes about a good chunk of the runtime
off a sort that merges 3 runs on one reasonable case tested where
work_mem was 300MB. It went from about 56.6 seconds with master to
35.8 seconds with this new approach when tested just now (this
approach saves no writing of tuples, so it's not as effective as the
original "quicksort with spillover" patch can be, but covers a
fundamentally different case). I just need to clean up the patch, and
see if I missed any further optimizations, but this feels like the way
forward multi-run wise. I think it's worth giving up on replacement
selection style runs after the first run is produced, because that's
where the benefits are, if anywhere.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru Hanada 2015-08-04 08:50:52 Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Previous Message Simon Riggs 2015-08-04 08:39:23 Re: FSM versus GIN pending list bloat