Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu>
Subject: Re: Using quicksort for every external sort run
Date: 2015-12-22 21:37:43
Message-ID: CAM3SWZQuApC3VR55xB48xA-i_5F+HQr77+LVtyeqSz7Ne53DbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 22, 2015 at 9:10 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So I was looking at the 0001 patch

Thanks. I'm going to produce a revision of 0002 shortly, so perhaps
hold off on that one. The big change there will be to call
grow_memtuples() to allow us to increase the number of slots without
palloc() overhead spuriously being weighed (since the memory for the
final on-the-fly merge phase doesn't have palloc() overhead). Also,
will incorporate what Jeff and you wanted around terminology.

> This looks like voodoo to me. I assume you tested it and maybe it
> gives correct answers, but it's got to be some kind of world record
> for number of arbitrary constants per SLOC, and there's no real
> justification for any of it. The comments say, essentially, well, we
> do this because it works. But suppose I try it on some new piece of
> hardware and it doesn't work well. What do I do? Email the author
> and ask him to tweak the arbitrary constants?

That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
something, though.

MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
And that part (the MaxAllocSize part of my patch) only defines a point
after which we require a really favorable case for replacement
selection/quicksort with spillover to proceed. It's a safety valve. We
try to err on the side of not using replacement selection.

> I wonder if there's a way to accomplish what you're trying to do here
> that avoids the need to have a cost model at all. As I understand it,
> and please correct me wherever I go off the rails, the situation is:
>
> 1. If we're sorting a large amount of data, such that we can't fit it
> all in memory, we will need to produce a number of sorted runs and
> then merge those runs. If we generate each run using a heap with
> replacement selection, rather than quicksort, we will produce runs
> that are, on the average, about twice as long, which means that we
> will have fewer runs to merge at the end.
>
> 2. Replacement selection is slower than quicksort on a per-tuple
> basis. Furthermore, merging more runs isn't necessarily any slower
> than merging fewer runs. Therefore, building runs via replacement
> selection tends to lose even though it tends to reduce the number of
> runs to merge. Even when having a larger number of runs results in an
> increase in the number merge passes, we save so much time building the
> runs that we often (maybe not always) still come out ahead.

I'm with you so far. I'll only add: doing multiple passes ought to be
very rare anyway.

> 3. However, when replacement selection would result in a single run,
> and quicksort results in multiple runs, using quicksort loses. This
> is especially true when we the amount of data we have is between one
> and two times work_mem. If we fit everything into one run, we do not
> need to write any data to tape, but if we overflow by even a single
> tuple, we have to write a lot of data to tape.

No, this is where you lose me. I think that it's basically not true
that replacement selection can ever be faster than quicksort, even in
the cases where the conventional wisdom would have you believe so
(e.g. what you say here). Unless you have very little memory relative
to data size, or something along those lines. The conventional wisdom
obviously needs some revision, but it was perfectly correct in the
1970s and 1980s.

However, where replacement selection can still help is avoiding I/O
*entirely*. If we can avoid spilling 95% of tuples in the first place,
and quicksort the remaining (heapified) tuples that were not spilled,
and merge an in-memory run with an on-tape run, then we can win big.
Quicksort is not amenable to incremental spilling at all. I call this
"quicksort with spillover" (it is a secondary optimization that the
patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark
discontinuity in the cost function of sorts. That could really help
with admission control, and simplifying the optimizer, making merge
joins less scary. So with the patch, "quicksort with spillover" and
"replacement selection" are almost synonymous, except that we
acknowledge the historic importance of replacement selection to some
degree. The patch completely discards the conventional use of
replacement selection -- it just preserves its priority queue (heap)
implementation where incrementalism is thought to be particularly
useful (avoiding I/O entirely).

But this comparison has nothing to do with comparing the master branch
with my patch, since the master branch never attempts to avoid I/O
having committed to an external sort. It uses replacement selection in
a way that is consistent with the conventional wisdom, wisdom which
has now been shown to be obsolete.

BTW, I think that abandoning incrementalism (replacement selection)
will have future benefits for memory management. I bet we can get away
with one big palloc() for second or subsequent runs that are
quicksorted, greatly reducing palloc() overhead and waste there, too.

> If this is correct so far, then I wonder if we could do this: Forget
> replacement selection. Always build runs by quicksorting. However,
> when dumping the first run to tape, dump it a little at a time rather
> than all at once. If the input ends before we've completely written
> the run, then we've got all of run 1 in memory and run 0 split between
> memory and tape. So we don't need to do any extra I/O; we can do a
> merge between run 1 and the portion of run 0 which is on tape. When
> the tape is exhausted, we only need to finish merging the in-memory
> tails of the two runs.

My first attempt at this -- before I realized that replacement
selection was just not a very good algorithm, due to the upsides not
remotely offsetting the downsides on modern hardware -- was a hybrid
between quicksort and replacement selection.

The problem is that there is too much repeated work. If you spill like
this, you have to quicksort everything again. The replacement
selection queue keeps track of a currentRun and nextRun, to avoid
this, but quicksort can't really do that well.

In general, the replacement selection heap will create a new run that
cannot be spilled (nextRun -- there won't be one initially) if there
is a value less than any of those values already spilled to tape. So
it is built to avoid redundant work in a way that quicksort really
cannot be.

> I also wonder if you've thought about the case where we are asked to
> sort an enormous amount of data that is already in order, or very
> nearly in order (2,1,4,3,6,5,8,7,...). It seems worth including a
> check to see whether the low value of run N+1 is higher than the high
> value of run N, and if so, append it to the existing run rather than
> starting a new one. In some cases this could completely eliminate the
> final merge pass at very low cost, which seems likely to be
> worthwhile.

While I initially shared this intuition -- that replacement selection
could hardly be beaten by a simple hybrid sort-merge strategy for
almost sorted input -- I changed my mind. I simply did not see any
evidence for it. I may have missed something, but it really does not
appear to be worth while. The quicksort fallback to insertion sort
also does well with presorted input. The merge is very cheap (over and
above reading one big run off disk) for presorted input under most
circumstances. A cost model adds a lot of complexity, which I hesitate
to add without clear benefits.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-12-22 22:23:47 Re: Additional role attributes && superuser review
Previous Message Peter Geoghegan 2015-12-22 19:23:02 Re: A Typo in regress/sql/privileges.sql