Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby.
Current issues/opportunities are these:
a) Memory is always in short supply, so using what we have more
effectively is going to be welcome.
b) Heap sort has a reasonably strong anti-memory effect, meaning that
there is an optimum amount of memory for any sort. This shows itself
with the CPU time increasing during run forming, making this stage of
the sort CPU bound.
c) Many sorts are performed prior to aggregation. It might be possible
to aggregate prior to writing to disk, as a way of reducing the overall
I/O cost. Benefit would occur when the total CPU cost was same no matter
when aggregation occurred; that would not apply in all cases, so we
would need to sense when benefit was possible.
d) Generally reducing the I/O cost of sorting may help the merging
stages of a sort.
The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed
to date were the following:
1. Sort I/O Compression
2. Aggregation during Sort
3. Memory Pools
4. Dynamic Heap Management
5. Dynamic Run Handling
I've added (5) to the list as well, which hasn't yet been discussed.
1. SORT I/O COMPRESSION
This idea is not dead yet, it just needs a full set of tests to confirm
that there is benefit in all cases. If there's not benefit in all cases,
we may be able to work out which cases those are, so we know when to use
2. AGGREGATION DURING SORT
Many sorts are preliminary steps before aggregation. Aggregation during
run forming would potentially reduce size of heap and reduce number of
comparisons. For many types of aggregate this would not theoretically
increase the number of ops since sum(), avg(), min(), max() are all
commutative according to their inputs. We would probably need to add
another option to Aggregate Functions to indicate the possibility of
calculating the aggregate in this way, since some aggregates might rely
on the current situation that they expect all their inputs at once in
sorted order. (Windowed aggregates are unlikely to be this way).
3. MEMORY POOLS
Solving a) could be done by sensible management and allocation of
resources. Discussed before, so not rehashed here.
4. DYNAMIC HEAP MANAGEMENT
The size of the active heap required to produce the fewest number of
runs varies as the sort progresses. For example, sorting an already
sorted input needs a trivial heap size.
Larger heap sizes simply avoid forming more runs, which is not
necessarily a bad thing. More runs only become bad things when we go
beyond our ability to perform a single final merge (see Dynamic Run
Smaller heap sizes reduce the number of comparisons required, plus
increase the L2+ cache efficiencies. Those two things are the cause of
the anti-memory effect.
Because of b), optimising the size of the heap could potentially be a
good thing. This can make a considerable difference for nearly sorted
data (measurements required...).
When we have M amount of memory available to us, we don't start by using
it all. We start with m memory and only increase up to M if required.
Runs are built with memory set at m. If a tuple arrives that would force
the formation of a new run we assess
i) do we care if another run is formed? Use our knowledge of the likely
amount of data coming our way, compared with number of runs formed so
far and see if we really care. If we don't care, allow the new run to be
formed and carry on with just heap size of m. (see Dynamic Run Handling
ii) if we do care about number of runs, then allow the heap to grow by
increments up to the full size of M. Increments would be at least x2 and
possibly x4. That way we always have work space to rearrange the heap.
All of this dances too cleverly around the exact technique and potential
costs of rearranging the heap. That is not to be ignored and is the next
task in evaluating and accepting/dismissing this potential technique.
In combination with memory pooling this technique might also allow
memory to be better distributed to other users.
5. DYNAMIC RUN HANDLING (in Final Merge)
Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:
Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
This is also the point where I suggest breaking away from Knuth
completely. All of the main algorithms described by Knuth are tape
sorts. A run is written to a particular tape and then stays there until
"moved" to another tape. That means we have to get super-clever about
how runs should be written and formed (see Knuth). If we realise that
the runs aren't fixed to particular tapes they are all just independent
runs, we can radically rethink sorting. There is no need to implement
Cascade Sort, but we do need to rethink merging from the ground up. (All
of which is a relief, because Knuth et al are definitely smarter than
me, but I've got disks and lots of memory and those guys had tapes.).
If we track the min and max values for each run, when run building is
finished we will be able to build a merging plan that allows us to be
smart about the runs we should bring together. We start with the run
with the lowest min value, as well as all runs that overlap that run.
When that run is exhausted we move to the next lowest and at that point
start merging all runs that overlap that one.
This then means we may be able to begin final merging with more runs
than the current cut-off. It's possible that we could merge an infinite
number of runs in final merge with fixed memory. If we *do* need to
merge we can work out which runs should be our best pre-merge
candidates, based upon how big they are and which other runs they
overlap. (That's much better than being forced to merge tapes 2, 7 and
17 because some bizarre math says so (see Knuth).)
Anyway, claiming to have found a better way than Knuth makes me feel a
little nervous, so some searching questions on this are very welcome.
Interestingly, if we combine this technique with dynamic heap management
we may be able to allow a very large number of efficiently written runs
to form without it causing any merging.
mac_man recently noted the possibility that some runs don't overlap at
all and so can be merged for free. That's true, though doesn't actually
improve the basic idea here which is building a merge plan after runs
have been formed, with an eye on minimizing and potentially elimination
the merge phase.
There's probably some typos or thinkos above, so go easy on me Greg!
They aren't there because I want to skim over anything.
I'm not likely to get a chance to do all of this in the near future, so
documenting it now should help others to carry things forward.
pgsql-hackers by date
|Next:||From: Peter Eisentraut||Date: 2007-11-27 18:04:08|
|Subject: Re: psql -f doesn't complain about directories|
|Previous:||From: Simon Riggs||Date: 2007-11-27 17:32:49|
|Subject: Quality and Performance|