Re: Using quicksort for every external sort run

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu>
Subject: Re: Using quicksort for every external sort run
Date: 2015-12-22 17:10:04
Message-ID: CA+TgmoaxNQE97escn3WVLUX1zYO49zZiCJTsBdm04_zq0QeC-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 6, 2015 at 7:25 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> So, the bottom line is: This patch seems very good, is unlikely to
>> have any notable downside (no case has been shown to be regressed),
>> but has yet to receive code review. I am working on a new version with
>> the first two commits consolidated, and better comments, but that will
>> have the same code, unless I find bugs or am dissatisfied. It mostly
>> needs thorough code review, and to a lesser extent some more
>> performance testing.
>
> I'm currently spending a lot of time working on parallel CREATE INDEX.
> I should not delay posting a new version of my patch series any
> further, though. I hope to polish up parallel CREATE INDEX to be able
> to show people something in a couple of weeks.
>
> This version features consolidated commits, the removal of the
> multipass_warning parameter, and improved comments and commit
> messages. It has almost entirely unchanged functionality.
>
> The only functional changes are:
>
> * The function useselection() is taught to distrust an obviously bogus
> caller reltuples hint (when it's already less than half of what we
> know to be the minimum number of tuples that the sort must sort,
> immediately after LACKMEM() first becomes true -- this is probably a
> generic estimate).
>
> * Prefetching only occurs when writing tuples. Explicit prefetching
> appears to hurt in some cases, as David Rowley has shown over on the
> dedicated thread. But it might still be that writing tuples is a case
> that is simple enough to benefit consistently, due to the relatively
> uniform processing that memory latency can hide behind for that case
> (before, the same prefetching instructions were used for CREATE INDEX
> and for aggregates, for example).
>
> Maybe we should consider trying to get patch 0002 (the memory
> pool/merge patch) committed first, something Greg Stark suggested
> privately. That might actually be an easier way of integrating this
> work, since it changes nothing about the algorithm we use for merging
> (it only improves memory locality), and so is really an independent
> piece of work (albeit one that makes a huge overall difference due to
> the other patches increasing the time spent merging in absolute terms,
> and especially as a proportion of the total).

So I was looking at the 0001 patch and came across this code:

+ /*
+ * Crossover point is somewhere between where memtuples is between 40%
+ * and all-but-one of total tuples to sort. This weighs approximate
+ * savings in I/O, against generic heap sorting cost.
+ */
+ avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+ /*
+ * Starting from a threshold of 90%, refund 7.5% per 32 byte
+ * average-size-increment.
+ */
+ increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+ crossover = 0.90 - (increments * 0.075);
+
+ /*
+ * Clamp, making either outcome possible regardless of average size.
+ *
+ * 40% is about the minimum point at which "quicksort with spillover"
+ * can still occur without a logical/physical correlation.
+ */
+ crossover = Max(0.40, Min(crossover, 0.85));
+
+ /*
+ * The point where the overhead of maintaining the heap invariant is
+ * likely to dominate over any saving in I/O is somewhat arbitrarily
+ * assumed to be the point where memtuples' size exceeds MaxAllocSize
+ * (note that overall memory consumption may be far greater). Past
+ * this point, only the most compelling cases use replacement selection
+ * for their first run.
+ *
+ * This is not about cache characteristics so much as the O(n log n)
+ * cost of sorting larger runs dominating over the O(n) cost of
+ * writing/reading tuples.
+ */
+ if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+ crossover = avgTupleSize > 32 ? 0.90 : 0.95;

This looks like voodoo to me. I assume you tested it and maybe it
gives correct answers, but it's got to be some kind of world record
for number of arbitrary constants per SLOC, and there's no real
justification for any of it. The comments say, essentially, well, we
do this because it works. But suppose I try it on some new piece of
hardware and it doesn't work well. What do I do? Email the author
and ask him to tweak the arbitrary constants?

The dependency on MaxAllocSize seems utterly bizarre to me. If we
decide to modify our TOAST infrastructure so that we support datums up
to 2GB in size, or alternatively datums of up to only 512MB in size,
do you expect that to change the behavior of tuplesort.c? I bet not,
but that's a major reason why MaxAllocSize is defined the way it is.

I wonder if there's a way to accomplish what you're trying to do here
that avoids the need to have a cost model at all. As I understand it,
and please correct me wherever I go off the rails, the situation is:

1. If we're sorting a large amount of data, such that we can't fit it
all in memory, we will need to produce a number of sorted runs and
then merge those runs. If we generate each run using a heap with
replacement selection, rather than quicksort, we will produce runs
that are, on the average, about twice as long, which means that we
will have fewer runs to merge at the end.

2. Replacement selection is slower than quicksort on a per-tuple
basis. Furthermore, merging more runs isn't necessarily any slower
than merging fewer runs. Therefore, building runs via replacement
selection tends to lose even though it tends to reduce the number of
runs to merge. Even when having a larger number of runs results in an
increase in the number merge passes, we save so much time building the
runs that we often (maybe not always) still come out ahead.

3. However, when replacement selection would result in a single run,
and quicksort results in multiple runs, using quicksort loses. This
is especially true when we the amount of data we have is between one
and two times work_mem. If we fit everything into one run, we do not
need to write any data to tape, but if we overflow by even a single
tuple, we have to write a lot of data to tape.

If this is correct so far, then I wonder if we could do this: Forget
replacement selection. Always build runs by quicksorting. However,
when dumping the first run to tape, dump it a little at a time rather
than all at once. If the input ends before we've completely written
the run, then we've got all of run 1 in memory and run 0 split between
memory and tape. So we don't need to do any extra I/O; we can do a
merge between run 1 and the portion of run 0 which is on tape. When
the tape is exhausted, we only need to finish merging the in-memory
tails of the two runs.

I also wonder if you've thought about the case where we are asked to
sort an enormous amount of data that is already in order, or very
nearly in order (2,1,4,3,6,5,8,7,...). It seems worth including a
check to see whether the low value of run N+1 is higher than the high
value of run N, and if so, append it to the existing run rather than
starting a new one. In some cases this could completely eliminate the
final merge pass at very low cost, which seems likely to be
worthwhile.

Unfortunately, it's possible to fool this algorithm pretty easily -
suppose the data is as in the parenthetical note in the previous
paragraph, but the number of tuples that fits in work_mem is odd. I
wonder if we can find instances where such cases regress significantly
as compared with the replacement selection approach, which might be
able to produce a single run out of an arbitrary amount of data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-22 17:13:32 Re: Speed up Clog Access by increasing CLOG buffers
Previous Message Tom Lane 2015-12-22 16:51:27 Possible marginally-incompatible change to array subscripting