Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2016-03-21 06:01:28
Message-ID: CAM3SWZS3ttv3_AjGw_BmRRd3QML3YZ9FueN=bwQ0r6dZVS2mDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> OK, I have now committed 0001

I attach a revision of the external quicksort patch and supplementary
small patches, rebased on top of the master branch.

Changes:

1. As I mentioned on the "Improve memory management for external
sorts" -committers thread, we should protect against currentRun
integer overflow. This new revision does so.

I'm not sure if that change needs to be back-patched; I just don't
want to take any risks, and see this as low cost insurance. Really low
workMem sorts are now actually fast enough that this seems like
something that could happen on a misconfigured system.

2. Add explicit constants for special run numbers that replacement
selection needs to care about in particular.

I did this because change 1 reminded me of the "currentRun vs.
SortTuple.tupindex" run numbering subtleties. The explicit use of
certain run number constants seems to better convey some tricky
details, in part by letting me add a few documenting if obvious
assertions. It's educational to be able to grep for the these
constants (e.g., the new HEAP_RUN_NEXT constant) to jump to the parts
of the code that need to think about replacement selection. As things
were, that code relied on too much from too great a distance (arguably
this is true even in the master branch). This change in turn led to
minor wordsmithing to adjacent areas here and there, most of it
subtractive.

As an example of where this helps, ISTM that the assertion added to
the routine tuplesort_heap_insert() is now self-documenting, which
wasn't the case before.

3. There was one very tricky consideration around an edge-case that
required careful thought. This was an issue within my new function
dumpbatch(). It could previously perform what turns out to be a
superfluous selectnewtape() call when we take the dumpbatch()
"state->memtupcount == 0" early return path (see the last revision for
full details of that now-axed code path). Now, we accept that there
may on occasion be 0 tuple runs. In other words, we now never returned
early from within dumpbatch().

There was previously no explanation for why it was okay to have a
superfluous selectnewtape() call. However, needing to be certain that
any newly selected destTape tape will go on to receive a run is
implied for the general case by this existing selectnewtape() code
comment:

* This is called after finishing a run when we know another run
* must be started. This implements steps D3, D4 of Algorithm D.

While the previous revision was correct anyway, I tried to explain why
it was correct in comments, and soon realized that that was a really
bad idea; the rationale was excessively complicated.

Allowing 0 tuple runs in rare cases seems like the simplest solution.
After all, mergeprereadone() is expressly prepared for 0 tuple runs.
It says "ensure that we have at least one tuple, if any are to be
had". There is no reason to assume that it says this only because it
imagines that no tuples might be found *only after* the first preread
for the merge (by which I mean I don't think that only applies when a
final on-the-fly merge reloads tuples from one particular tape
following running out of tuples of the tape/run in memory).

4. I updated the function beginmerge() to acknowledge an inconsistency
for pass-by-value datumsorts, which I mentioned in passing on this
thread a few days back. The specific change:

@@ -2610,7 +2735,12 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)

if (finalMergeBatch)
{
- /* Free outright buffers for tape never actually allocated */
+ /*
+ * Free outright buffers for tape never actually allocated. The
+ * !state->tuples case is actually entitled to have at least this much
+ * of a refund ahead of its final merge, but we don't go out of our way
+ * to marginally improve that case.
+ */
FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);

It's not worth worrying about this case, since the savings are small
(especially now that maxTapes is capped). But it's worth acknowledging
that the "!state->tuples" case is being "short-changed", in the new
spirit of heavily scrutinizing where memory goes in tuplesort.c.

5. I updated the "Add MemoryContextStats() calls for debugging" patch.
I now formally propose that this debugging instrumentation be
committed.

This revised debugging instrumentation patch does not have the system
report anything about the memory context just because "trace_sort =
on". Rather, it does nothing on ordinary builds, where the macro
SHOW_MEMORY_STATS will not be defined (it also respects trace_sort).
This is about the same approach seen in postgres.c's
finish_xact_command(). ISTM that we ought to provide a way of
debugging memory use within tuplesort.c, since we now know that that
could be very important. Let's not forget where the useful places to
look for problems are.

6. Based on your feedback on the batch memory patch (your commit
c27033ff), I made a stylistic change. I made similar comments about
the newly added quicksort/dumpbatch() MemoryContextReset() call, since
it has its own special considerations (a big change in the pattern of
allocations occurs after batch memory is used -- we need to be careful
about how that could impact the "bucketing by size class").

Thanks
--
Peter Geoghegan

Attachment Content-Type Size
0003-Add-MemoryContextStats-calls-for-debugging.patch text/x-patch 1.4 KB
0002-Perform-memory-prefetching-when-writing-memtuples.patch text/x-patch 7.7 KB
0001-Quicksort-when-performing-external-sorts.patch text/x-patch 39.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2016-03-21 06:34:40 Re: dealing with extension dependencies that aren't quite 'e'
Previous Message Amit Kapila 2016-03-21 05:56:43 Re: Performance degradation in commit ac1d794