Quick Links

Re: Using quicksort for every external sort run

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject:	Re: Using quicksort for every external sort run
Date:	2016-03-11 02:54:10
Message-ID:	CAM3SWZRFzg1LUK8FBg_goZ8zL0n7k6q83qQjhOV8NDZioA5TEQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> The query I'm testing is: "reindex index pgbench_accounts_pkey;"
>
> Now, with a maintenance_work_mem of 5MB, the most recent revision of
> my patch takes about 54.2 seconds to complete this, as compared to
> master's 44.4 seconds. So, clearly a noticeable regression there of
> just under 20%. I did not see a regression with a 5MB
> maintenance_work_mem when pgbench scale was 100, though.

I've fixed this regression, and possibly all regressions where workMem
> 4MB. I've done so without resorting to making the heap structure
more complicated, or using a heap more often than when
replacement_sort_mem is exceeded by work_mem or maintenance_work_mem
(so replacement_sort_mem becomes something a bit different to what we
discussed, Robert -- more on that later). This seems like an
"everybody wins" situation, because in this revision the patch series
is now appreciably *faster* where the amount of memory available is
only a tiny fraction of the total input size.

Jeff Janes deserves a lot of credit for helping me to figure out how
to do this. I couldn't get over his complaint about the regression he
saw a few months back. He spoke of an "anti-sweetspot" in polyphase
merge, and how he suspected that to be the real culprit (after all,
most of his time was spent merging, with or without the patch
applied). He also said that reverting the memory batch/pool patch made
things go a bit faster, somewhat ameliorating his regression (when
just the quicksort patch was applied). This made no sense to me, since
I understood the memory batching patch to be orthogonal to the
quicksort thing, capable of being applied independently, and more or
less a strict improvement on master, no matter what the variables of
the sort are. Jeff's regressed case especially made no sense to me
(and, I gather, to him) given that the regression involved no
correlation, and so clearly wasn't reliant on generating far
fewer/longer runs than the patch (that's the issue we've discussed
more than any other now -- it's a red herring, it seems). As I
suspected out loud on February 14th, replacement selection mostly just
*masked* the real problem: the problem of palloc() fragmentation.
There doesn't seem to be much of an issue with the scheduling of
polyphase merging, once you fix palloc() fragmentation. I've created a
new revision, incorporating this new insight.

New Revision
============

Attached revision of patch series:

1. Creates a separate memory context for tuplesort's copies of
caller's tuples, which can be reset at key points, avoiding
fragmentation. Every SortTuple.tuple is allocated there (with trivial
exception); *everything else*, including the memtuples array, is
allocated in the existing tuplesort context, which becomes the parent
of this new "caller's tuples" context. Roughly speaking, that means
that about half of total memory for the sort is managed by each
context in common cases. Even with a high work_mem memory budget,
memory fragmentation could previously get so bad that tuplesort would
in effect claim a share of memory from the OS that is *significantly*
higher than the work_mem budget allotted to its sort. And with low
work_mem settings, fragmentation previously made palloc() thrash the
sort, especially during non-final merging. In this latest revision,
tuplesort now almost gets to use 100% of the memory that was requested
from the OS by palloc() is cases tested.

2. Loses the "quicksort with spillover" case entirely, making the
quicksort patch significantly simpler. A *lot* of code was thrown out.

This change is especially significant because it allowed me to remove
the cost model that Robert took issue with so vocally. "Quicksort with
spillover" was always far less important than the basic idea of using
quicksort for external sorts, so I'm not sad to see it go. And, I
thought that the cost model was pretty bad myself.

3. Fixes cost_sort(), making optimizer account for the fact that runs
are now about sort_mem-sized, not (sort_mem * 2)-sized.

While I was at it, I made cost_sort() more optimistic about the amount
of random I/O required relative to sequential I/O. This additional
change to cost_sort() was probably overdue.

4. Restores the ability of replacement selection to generate one run
and avoid any merging (previously, only one really long run and one
short run was possible, because at the time I conceptualized
replacement selection as being all about enabling "quicksort with
spillover", which quicksorted that second run in memory). This
only-one-run case is the case that Robert particularly cared about,
and it's fully restored when RS is in use (which can still only happen
for the first run, just never for the benefit of the now-axed
"quicksort with spillover" case).

5. Adds a new GUC, "replacement_sort_mem". The default setting is
16MB. Docs are included. If work_mem/maintenance_work_mem is less than
or equal to this, the first (and hopefully only) run uses replacement
selection.

"replacement_sort_mem" is a different thing to the concept for a GUC
Robert and I discussed (only the name is the same). That other concept
for a GUC related to the hybrid heap/quicksort idea (it controlled how
big the heap portion of memtuples was supposed to be, in a speculative
world where the patch took that "hybrid" approach [1] at all). In
light of this new information about palloc() fragmentation, and given
the risk to tuplesort's stability posed by implementing this "hybrid"
algorithm, this seems like a good compromise. I cannot see an upside
to pursuing the "hybrid" approach now. I regret reversing my position
on that, but that's just how it happened. Since Robert was seemingly
only worried about regressions, which are fixed now for a variety of
cases that I tested, I'm optimistic that this will be acceptable to
him. I believe that replacement_sort_mem as implemented here is quite
useful, although mostly because I see some further opportunities for
it.

Replacement Selection uses
--------------------------

What opportunities, you ask? Maybe CREATE INDEX can be made to accept
a "presorted" parameter, letting the user promise that the input is
more or less presorted. This allows tuplesort to only use a tiny heap,
while having it throw an error if it cannot produce one long run (i.e.
CREATE INDEX is documented as raising an error if the input is not
more or less presorted). The nice thing about that idea is that we can
be very optimistic about the data actually being more or less
presorted, so the implementation doesn't *actually* produce one long
run -- it produces one long *index*, with IndexTuples passed back to
nbtsort.c as soon as the heap fills for the first time, a bit like an
on-the-fly merge. Unlike an on-the-fly merge, no tapes or temp files
are actually involved; we write out IndexTuples by actually writing
out the index optimistically. There is a significant saving by using a
heap *because there is no need for a TSS_SORTEDONTAPE pass over the
data*. We succeed at doing it all at once with a tiny heap, or die
trying. So, in a later version of Postgres (9.7?),
replacement_sort_mem becomes more important because of this
"presorted" CREATE INDEX parameter. That's a relatively easy patch to
write, but it's not 9.6 material.

Commits
-------

Note that the attached revision makes the batch memory patch the first
commit in the patch series. It might be useful to get that one out of
the way first, since I imagine it is now considered the least
controversial, and is perhaps the simplest of the two big patches in
the series. I'm not very optimistic about the memory prefetch patch
0003-* getting committed, but so far I've only seen it help, and all
of my testing is based on having it applied. In any case, it's clearly
way way less important than the other two patches.

Testing
-------

N.B.: The debug patch, 0004-*, should not be applied during benchmarking.

I've used amcheck [2] to test this latest revision -- the tool ought
to not see any problems with any index created with the patch applied.
Reviewers might find it helpful to use amcheck, too. As 9.6 is
stabilized, I anticipate that amcheck will give us a fighting chance
at early detection of any bugs that might have slipped into tuplesort,
or a B-Tree operator class. Since we still don't even have one single
test of the external sort code [3], it's just as well. If we wanted to
test external sorting, maybe we'd do that by adding tests to amcheck,
that are not run by default, much like test_decoding, which tests
logical decoding but is not targeted by "make installcheck"; that
would allow the tests to be fairly comprehensive without being
annoying. Using amcheck neatly side-steps issues with the portability
of "expected" pg_regress output when collatable type sorting is
tested.

Thoughts?

[1] http://www.postgresql.org/message-id/CA+TgmoY87y9FuZ=NE7JayH2emtovm9Jp9aLfFWunjF3utq4hfg@mail.gmail.com
[2] https://commitfest.postgresql.org/9/561/
[3] http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/src/backend/utils/sort/tuplesort.c.gcov.html
--
Peter Geoghegan

Attachment	Content-Type	Size
0004-Add-MemoryContextStats-calls-for-debugging.patch	text/x-patch	1.4 KB
0003-Perform-memory-prefetching-when-writing-memtuples.patch	text/x-patch	7.8 KB
0002-Quicksort-when-performing-external-sorts.patch	text/x-patch	37.3 KB
0001-Allocate-memory-in-batch-for-tuplesort-merge.patch	text/x-patch	37.4 KB

In response to

Re: Using quicksort for every external sort run at 2016-02-15 04:01:27 from Peter Geoghegan

Responses

Re: Using quicksort for every external sort run at 2016-03-16 22:31:10 from Robert Haas
Re: Using quicksort for every external sort run at 2016-03-22 21:27:56 from Tomas Vondra
Re: Using quicksort for every external sort run at 2016-03-29 03:18:59 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2016-03-11 03:22:18	Re: Fix for OpenSSL error queue bug
Previous Message	Peter Geoghegan	2016-03-11 02:38:28	Re: Fix for OpenSSL error queue bug