Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeremy Harris <jgh(at)wizmail(dot)org>
Subject: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date: 2015-07-30 01:05:59
Message-ID: CAM3SWZTzLT5Y=VY320NznAyz2z_em3us6x=7rXMEUma9Z9yN6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The behavior of external sorts that do not require any merge step due
to only having one run (what EXPLAIN ANALYZE output shows as an
"external sort", and not a "merge sort") seems like an area that can
be significantly improved upon. As noted in code comments, this
optimization did not appear in The Art of Computer Programming, Volume
III. It's not an unreasonable idea, but it doesn't work well on modern
machines due to using heapsort, which is known to use the cache
ineffectively. It also often implies significant additional I/O for
little or no benefit. I suspect that all the reports we've heard of
smaller work_mem sizes improving sort performance are actually down to
this one-run optimization *hurting* performance.

The existing optimization for this case tries to benefit from avoiding
a final N-way merge step; that's what it's all about (this is not to
be confused with the other reason why a sort can end in
TSS_SORTEDONTAPE -- because random tape access is required, and an
*on-the-fly* TSS_FINALMERGE merge step cannot happen. Currently,
TSS_SORTEDONTAPE is sometimes the "fast" case and sometimes the slow
case taken only because the caller specified randomAccess, and I'm
improving on only the "fast" case [1], where the caller may or may not
have requested randomAccess. I require that they specify !randomAccess
to use my improved optimization, though, and I'm not trying to avoid a
merge step.)

The existing optimization just dumps all tuples in the memtuples array
(which is a heap at that point) to tape, from the top of the heap,
writing a tuple out at a time, maintaining the heap invariant
throughout. Then, with the memtuples array emptied, tuples are fetched
from tape/disk for client code, without any merge step occurring
on-the-fly (or at all).

Patch -- "quicksort with spillover"
=========================

With the attached patch, I propose to add an additional, better "one
run special case" optimization. This new special case "splits" the
single run into 2 "subruns". One of the runs is comprised of whatever
tuples were in memory when the caller finished passing tuples to
tuplesort. To sort that, we use quicksort, which in general has
various properties that make it much faster than heapsort -- it's a
cache oblivious algorithm, which is important these days. The other
"subrun" is whatever tuples were on-tape when tuplesort_performsort()
was called. This will often be a minority of the total, but certainly
never much more than half. This is already sorted when
tuplesort_performsort() is reached. This spillover is already inserted
at the front of the sorted-on-tape tuples, and so already has
reasonably good cache characteristics.

With the patch, we perform an on-the-fly merge that is somewhat
similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case,
except that one of the runs is in memory, and is potentially much
larger than the on-tape/disk run (but never much smaller), and is
quicksorted. The existing "merge sort" case similarly is only used
when the caller specified !randomAccess.

For subtle reasons, the new TSS_MEMTAPEMERGE case will happen
significantly more frequently than the existing, comparable
TSS_SORTEDONTAPE case currently happens (this applies to !randomAccess
callers only, of course). See comments added to
tuplesort_performsort(), and the commit message for the details. Note
that the existing, comparable case was relocated to
tuplesort_performsort(), to highlight that it is now a fallback for
the new TSS_MEMTAPEMERGE case (also in tuplesort_performsort()).

Performance
==========

This new approach can be much faster. For example:

select setseed(1);
-- 10 million tuple table with low cardinality integer column to sort:
create unlogged table low_cardinality_int4 as select (random() *
1000)::int4 s, 'abcdefghijlmn'::text junk from generate_series(1,
10000000);
set work_mem = '225MB';

-- test query:
select count(distinct(s)) from low_cardinality_int4;
count
-------
1001
(1 row)

On my laptop, a patched Postgres takes about 4.2 seconds to execute
this query. Master takes about 16 seconds. The patch sees this case
quicksort 9,830,398 tuples out of a total of 10 million with this
225MB work_mem setting. This is chosen to be a sympathetic case, but
it is still quite realistic.

We should look at a much less sympathetic case, too. Even when the
in-memory "subrun" is about as small as it can be while still having
this new special case optimization occur at all, the patch still ends
up pretty far ahead of the master branch. With work_mem at 100MB,
4,369,064 tuples are quicksorted by the patch when the above query is
executed, which is less than half of the total (10 million). Execution
with the patch now takes about 10.2 seconds. Master is about 14.7
seconds with the same work_mem setting (yes, master gets faster as the
patch gets slower as work_mem is reduced...but they never come close
to meeting). That's still a fairly decent improvement, and it occurs
when we're on the verge of not using the optimization at all.

Most users that really care about performance will at least have
enough memory to benefit from this optimization when building an index
on a large table, because the patch roughly halves the amount of
memory you need to get at least some of the benefit of an internal
sort.

Performance regressions
----------------------------------

I have been unable to find any regressions in the performance of
queries with the patch. If you're looking for a query that might have
been regressed, I suggest coming up with a case involving a sort with
pass-by-reference types that are expensive. For example, sorting a
tuple on many low cardinality text attributes. Only the most extreme
such cases seem likely to be regressed, though, because heapsort also
has bad temporal locality. My strcoll() result caching patch [2] tries
to take advantage of temporal locality rather than spatial locality,
which works well with quicksort and mergesort.

Memory use
-----------------

The new TSS_MEMTAPEMERGE case uses no more memory than the existing
"TSS_SORTEDONTAPE due to one run" optimization (actually, it uses very
slightly less) if we only worry about the high-water mark. In both
cases the high-water mark comes as the work_mem limit is reached.
Typically, most memory isn't released until the sort is shut down.

Because we're not dumping all tuples here (only as many as we need to
dump to stay within our memory budget), we're also not freeing memory
for each and every "tuple proper" as each and every SortTuple is
written out (because we usually avoid writing out most tuples, which
is an important goal of the patch). Although the memtuples array
itself is often tuplesort's dominant consumer of work_mem, it's still
possible that the aggregate effect of this patch on some workload's
memory consumption is that more memory is used. I doubt it, though;
the overall effect on memory usage will probably always be to reduce
it. Finishing a sort sooner allows us to make memory available for
other operations sooner. Besides, we have broken no promise to the
caller wrt memory use.

Predictability
==========

The patch alters the performance characteristics of tuplesort, but
very much for the better. With the patch, as less memory is gradually
made available for sorting, performance tends to also degrade
gradually, because we more or less have an algorithm that's a kind of
hybrid internal/external sort, that, roughly speaking, blends from the
former to the latter based on the availability of memory.

It's particularly useful that performance doesn't fall off a cliff
when we can no longer fit everything in memory because work_mem is
slightly too small. The "almost internal" query from my example above
takes about 4.2 seconds. An equivalent internal sort (quicksort) takes
about 3.5 seconds, which is pretty close (~98% of tuples are
quicksorted for the "almost internal" case, but heapification is a
price we must pay to spill even one tuple). Furthermore, as work_mem
is decreased to the point that even the optimization is no longer used
-- when a traditional "merge sort"/TSS_FINALMERGE is used, instead --
there is also no big drop.

*Predictable* performance characteristics are a big asset. Decreasing
work_mem by 10% (or a moderate increase in the size of a table) should
not make the performance of reporting queries tank. Concerns about
worst case performance (e.g. particular queries suddenly taking much
longer to execute) have certainly prevented me from decreasing
work_mem from within postgresql.conf in the past, even though I was
fairly confident that it made sense for the average case.

Open Items
=========

There are a few smaller open items indicated by "XXX" comments. There
is a little overlap with this patch, and a couple of others that are
in the queue that also affect sorting. For example, I'm considerate of
cases that don't exist yet.

Future work
=========

In the future, we should think about optimization of the "merge
sort"/TSS_FINALMERGE case, which should be made to sort *every* run
using quicksort (the devil in the details there, but in general I
think that runs should be quicksorted wherever possible). For now,
what I came up with seems like a relatively simple approach that
offers much of the benefit of that more comprehensive project, since
the need to do a "merge sort" is increasingly rare due to the enormous
main memory sizes that are affordable these days. Of course, Noah's
excellent work on huge repalloc() also contributes to our being able
to put large amounts of memory to good use when sorting.

Since heapification is now a big fraction of the total cost of a sort
sometimes, even where the heap invariant need not be maintained for
any length of time afterwards, it might be worth revisiting the patch
to make that an O(n) rather than a O(log n) operation [3]. Not sure
about that, but someone should probably look into it. Jeremy Harris is
CC'd here; perhaps he will consider picking that work up once more in
light of this development. It would be nice to get a sort that
quicksorts 99% of all tuples even closer to a conventional internal
sort.

[1] Seems bogus to me that EXPLAIN ANALYZE shows state
TSS_SORTEDONTAPE as "external sort" rather than a "merge sort",
regardless of whether or not a merge step was actually required (that
is, regardless of whether or not state ended up TSS_SORTEDONTAPE due
to using the existing "single run" optimization, or because caller
required randomAccess)
[2] https://commitfest.postgresql.org/6/294/
[3] http://www.postgresql.org/message-id/52F16843.8080001@wizmail.org
--
Peter Geoghegan

Attachment Content-Type Size
0002-Add-cursory-regression-tests-for-sorting.patch text/x-patch 9.6 KB
0001-Quicksort-when-only-one-initial-run-is-produced.patch text/x-patch 21.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-07-30 02:48:47 Re: Improving test coverage of extensions with pg_dump
Previous Message Peter Geoghegan 2015-07-30 00:41:10 Re: Eliminating CREATE INDEX comparator TID tie-breaker overhead