Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Using quicksort for every external sort run
Date: 2015-11-25 19:15:24
Message-ID: CAM3SWZTAZEhF+trC9BOLHO8sD8WVXXJqvTP=b5uPK62c7x6yQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 25, 2015 at 4:10 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> That's precisely why it's valuable to see a whole series of data
> points rather than just one. Often when you see the shape of the
> curve, especially any breaks or changes in the behaviour that helps
> understand the limitations of the model. Perhaps it would be handy to
> find a machine with a very small amount of physical memory so you
> could run more reasonably sized tests on it. A VM would be fine if you
> could be sure the storage layer isn't caching.

I have access to the Power7 system that Robert and others sometimes
use for this stuff. I'll try to come up a variety of tests.

> In short, I think you're right in theory and I want to make sure
> you're right in practice. I'm afraid if we just look at a few data
> points we'll miss out on a bug or a factor we didn't anticipate that
> could have been addressed.

I am in favor of being comprehensive.

> Just to double check though. My understanding is that your quicksort
> algorithm is to fill work_mem with tuples, quicksort them, write out a
> run, and repeat. When the inputs are done read work_mem/runs worth of
> tuples from each run into memory and run a merge (using a heap?) like
> we do currently. Is that right?

Yes, that's basically what I'm doing.

There are basically two extra bits:

* Without changing how merging actually works, I am clever about
allocating memory for the final on-the-fly merge. Allocation is done
once, in one huge batch. Importantly, I exploit locality by having
every "tuple proper" (e.g. IndexTuple) in contiguous memory, in sorted
(tape) order, per tape. This also greatly reduces palloc() overhead
for the final on-the-fly merge step.

* We do something special when we're just over work_mem, to avoid most
I/O -- "quicksort with spillover". This is a nice trick, but it's
certain way less important than the basic idea of simply always
quicksorting runs. I could easily not do this. This is why the heap
code was not significantly simplified to only cover the merge cases,
though -- this uses essentially the same replacement selection style
heap to incrementally spill to get us enough memory to mostly complete
the sort internally.

> Incidentally one of the reasons abandoning the heap to generate runs
> is attractive is that it opens up other sorting algorithms for us.
> Instead of quicksort we might be able to plug in a GPU sort for
> example.

Yes, it's true that we automatically benefit from optimizations for
the internal sort case now. That's already happening with the patch,
actually -- the "onlyKey" optimization (a more specialized quicksort
specialization, used in the one attribute heap case, and datum case)
is now automatically used. That was where the best 2012 numbers for
SortSupport were seen, so that makes a significant difference. As you
say, something like that could easily happen again.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-11-25 19:21:49 Re: proposal: multiple psql option -c
Previous Message Tom Lane 2015-11-25 18:55:04 Re: New email address