Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeremy Harris <jgh(at)wizmail(dot)org>
Subject: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date: 2015-07-30 07:00:11
Message-ID: 55B9CB7B.7070204@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/30/2015 04:05 AM, Peter Geoghegan wrote:
> Patch -- "quicksort with spillover"
> =========================
>
> With the attached patch, I propose to add an additional, better "one
> run special case" optimization. This new special case "splits" the
> single run into 2 "subruns". One of the runs is comprised of whatever
> tuples were in memory when the caller finished passing tuples to
> tuplesort. To sort that, we use quicksort, which in general has
> various properties that make it much faster than heapsort -- it's a
> cache oblivious algorithm, which is important these days. The other
> "subrun" is whatever tuples were on-tape when tuplesort_performsort()
> was called. This will often be a minority of the total, but certainly
> never much more than half. This is already sorted when
> tuplesort_performsort() is reached. This spillover is already inserted
> at the front of the sorted-on-tape tuples, and so already has
> reasonably good cache characteristics.
>
> With the patch, we perform an on-the-fly merge that is somewhat
> similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case,
> except that one of the runs is in memory, and is potentially much
> larger than the on-tape/disk run (but never much smaller), and is
> quicksorted. The existing "merge sort" case similarly is only used
> when the caller specified !randomAccess.

Hmm. You don't really need to merge the in-memory array into the tape,
as you know that all the tuples in the in-memory must come after the
tuples already on the tape. You can just return all the tuples from the
tape first, and then all the tuples from the array.

So here's a shorter/different explanation of this optimization: When
it's time to perform the sort, instead of draining the in-memory heap
one tuple at a time to the last tape, you sort the heap with quicksort,
and pretend that the sorted heap belongs to the last tape, after all the
other tuples in the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs?
Quicksorting the heap and keeping it as an array in memory is surely
always faster than heapsorting and pushing it to the tape.

I think it'd make sense to structure the code differently, to match the
way I described this optimization above. Instead of adding a new
tuplesort state for this, abstract this in the logical tape code. Add a
function to attach an in-memory "tail" to a tape, and have
LogicalTapeRead() read from the tail after reading the on-disk tape. The
rest of the code wouldn't need to care that sometimes part of the tape
is actually in memory.

It should be pretty easy to support randomAccess too. If you think of
the in-memory heap as a tail of the last tape, you can easily move
backwards from the in-memory heap back to the on-disk tape, too.

> + * Note that there might actually be 2 runs, but only the
> + * contents of one of them went to tape, and so we can
> + * safely "pretend" that there is only 1 run (since we're
> + * about to give up on the idea of the memtuples array being
> + * a heap). This means that if our sort happened to require
> + * random access, the similar "single run" optimization
> + * below (which sets TSS_SORTEDONTAPE) might not be used at
> + * all. This is because dumping all tuples out might have
> + * forced an otherwise equivalent randomAccess case to
> + * acknowledge a second run, which we can avoid.

Is that really true? We don't start a second run until we have to, i.e.
when it's time to dump the first tuple of the second run to tape. So I
don't think the case you describe above, where you have two runs but
only one of them has tuples on disk, can actually happen.

> Performance
> ==========

Impressive!

> Predictability
> ==========

Even more impressive!

> Future work
> =========

As an extra optimization, you could delay quicksorting the in-memory
array until it's time to read the first tuple from it. If the caller
reads only the top-N tuples from the sort for some reason (other than
LIMIT, which we already optimize for), that could avoid a lot of work.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-07-30 08:14:16 Re: Don'st start streaming after creating a slot in pg_receivexlog
Previous Message Noah Misch 2015-07-30 06:49:41 Re: security labels on databases are bad for dump & restore