Quick Links

Re: Using quicksort for every external sort run

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Using quicksort for every external sort run
Date:	2015-09-06 00:48:10
Message-ID:	CAM3SWZRiHaF7jdf923ZZ2qhDJiErqP5uU_+JPuMvUmeD0z9fFA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Let's start by comparing an external sort that uses 1/3 the memory of
> an internal sort against the master branch. That's completely unfair
> on the patch, of course, but it is a useful indicator of how well
> external sorts do overall. Although an external sort surely cannot be
> as fast as an internal sort, it might be able to approach an internal
> sort's speed when there is plenty of I/O bandwidth. That's a good
> thing to aim for, I think.

> The patch only takes ~10% more time to execute this query, which seems
> very good considering that ~1/3 the work_mem has been put to use.

> Note that the on-tape runs are small relative to CPU costs, so this
> query is a bit sympathetic (consider the time spent writing batches
> that trace_sort indicates here). CREATE INDEX would not compare so
> well with an internal sort, for example, especially if it was a
> composite index or something.

This is something that I've made great progress on (see "concrete
example" below for numbers). The differences in the amount of I/O
required between these two cases (due to per-case variability in the
width of tuples written to tape for datum sorts and index sorts) did
not significantly factor in to the differences in performance, it
turns out. The big issue was that while a pass-by-value datum sort
accidentally has good cache characteristics during the merge step,
that is not generally true. I figured out a way of making it generally
true, though. I attach a revised patch series with a new commit that
adds an optimization to the merge step, relieving what was a big
remaining bottleneck in the CREATE INDEX case (and *every* external
sort case that isn't a pass-by-value datum sort, which is most
things). There are a few tweaks to earlier commits including, but
nothing very interesting.

All of my benchmarks suggests that this most recent revision puts
external sorting within a fairly small margin of a fully internal sort
on the master branch in many common cases. This difference is seen
when the implementation only makes use of a fraction of the memory
required for an internal sort, provided the system is reasonably well
balanced. For a single backend, there is an overhead of about 5% - 20%
against master's internal sort performance. This speedup appears to be
fairly robust across a variety of different cases.

I particularly care about CREATE INDEX, since that is where most pain
is felt in the real world, and I'm happy that I found a way to make
CREATE INDEX external sort reasonably comparable in run time to
internal sorts that consume much more memory. I think it's time to
stop talking about this as performance work, and start talking about
it as scalability work. With that in mind, I'm mostly going to compare
the performance of the new, optimized external sort implementation
with the existing internal sort implementation from now on.

New patch -- Sequential memory access
===============================

The trick I hit upon for relieving the merge bottleneck was fairly simple.

Prefetching works for internal sorts, but isn't practical for external
sorts while merging. OTOH, I can arrange to have runs allocate their
"tuple proper" contents into a memory pool, partitioned by final
on-the-fly tape number. Today, runs/tapes are slurped from disk
sequentially in a staggered fashion, based on the availability of
in-memory tuples from each tape while merging. The new patch is very
effective in reducing cache misses by simply making sure that each
tape's "tuple proper" (e.g. each IndexTuple) is accessed in memory in
the natural, predictable order (the sorted order that runs on tape
always have). Unlike with internal sorts (where explicit memory
prefetching of each "tuple proper" may be advisable), the final order
in which the caller must consume a tape's "tuple proper" is
predictable well in advance.

A little rearrangement is required to make what were previously retail
palloc() calls during prereading (a palloc() for each "tuple proper",
within each READTUP() routine) consume space from the memory pool
instead. The pool (a big, once-off memory allocation) is reused in a
circular fashion per tape partition. This saves a lot of palloc()
overhead.

Under this scheme, each tape's next few IndexTuples are all in one
cacheline. This patch has the merge step make better use of available
memory bandwidth, rather than attempting to conceal memory latency.
Explicit prefetch instructions (that we may independently end up using
to do something similar with internal sorts when fetching tuples
following sorting proper) are all about hiding latency.

Concrete example -- performance
---------------------------------------------

I attach a text file describing a practical, reproducible example
CREATE INDEX. It shows how CREATE INDEX now compares fairly well with
an equivalent operation that has enough maintenance_work_mem to
complete its sort internally. I'll just summarize it here:

A CREATE INDEX on a single int4 attribute on an unlogged table takes
only ~18% longer. This is a 100 million row table that is 4977 MB on
disk. On master, CREATE INDEX takes 66.6 seconds in total with an
*internal* sort. With the patch series applied, an *external* sort
involving a final on-the-fly merge of 6 runs takes 78.5 seconds.
Obviously, since there are 6 runs to merge, work_mem is only
approximately 1/6 of what is required for a fully internal sort.

High watermark memory usage
------------------------------------------

One concern about the patch may be that it increases the high
watermark memory usage by any on-the-fly final merge step. It takes
full advantage of the availMem allowance at a point where every "tuple
proper" is freed, and availMem has only had SortTuple/memtuples array
"slot" memory subtracted (plus overhead). Memory is allocated in bulk
once, and partitioned among active tapes, with no particular effort
towards limiting memory usage beyond enforcing that we always
!LACKMEM().

A lot of the overhead of many retail palloc() calls is removed by
simply using one big memory allocation. In practice, LACKMEM() will
rarely become true, because the availability of slots now tends to be
the limiting factor. This is partially explained by the number of
slots being established when palloc() overhead was in play, prior to
the final merge step. However, I have concerns about the memory usage
of this new approach.

With the int4 CREATE INDEX case above, which has a uniform
distribution, I noticed that about 40% of each tape's memory space
remains unused when slots are exhausted. Ideally, we'd only have
allocated enough memory to run out at about the same time that slots
are exhausted, since the two would be balanced. This might be possible
for fixed-sized tuples. I have not allocated each final on-the-fly
merge step's active tape's pool individually, because while this waste
of memory is large enough to be annoying, it's not large enough to be
significantly helped by managing a bunch of per-tape buffers and
enlarging them as needed geometrically (e.g. starting small, and
doubling each time the buffer size is hit until the per-tape limit is
finally reached).

The main reason that the high watermark is increased is not because of
this, though. It's mostly just that "tuple proper" memory is not freed
until the sort is done, whereas before there were many small pfree()
calls to match the many palloc() calls -- calls that occurred early
and often. Note that the availability of "slots" (i.e. the size of the
memtuples array, minus one element for each tape's heap item) is
currently determined by whatever size it happened to be at when
memtuples stopped growing, which isn't particularly well principled
(hopefully this is no worse now).

Optimal memory usage
-------------------------------

In the absence of any clear thing to care about most beyond making
sorting faster while still enforcing !LACKMEM(), for now I've kept it
simple. I am saving a lot of memory by clawing back palloc() overhead,
but may be wasting more than that in another way now, to say nothing
of the new high watermark itself. If we're entirely I/O bound, maybe
we should not waste memory by simply not allocating as much anyway
(i.e. the extra memory may only theoretically help even when it is
written to). But what does it really mean to be I/O bound? The OS
cache probably consumes plenty of memory, too.

Finally, let us not forget that it's clearly still the case that even
following this work, run size needs to be optimized using a cost
model, rather than simply being determined by how much memory can be
made available (work_mem). If we get a faster sort using far less
work_mem, then the DBA is probably accidentally wasting huge amounts
of memory due to failing to do that. As an implementor, it's really
hard to balance all of these concerns, or to say that one in
particular is most urgent.

Parallel sorting
===========

Simon rightly emphasized the need for joined-up thinking in relation
to applying important tuplesort optimizations. We must at least
consider parallelism as part of this work.

I'm glad that the first consumer of parallel infrastructure is set to
be parallel sequential scans, not internal parallel sorts. That's
because it seems that overall, a significant cost is actually reading
tuples into memtuples to sort -- heap scanning and related costs in
the buffer manager (even assuming everything is in shared_buffers),
COPYTUP() palloc() calls, and so on. Taken together, they can be a
bigger overall cost than sorting proper, even assuming abbreviated
keys are not used. The third bucket that I tend to categorize costs
into, "time spent actually writing out finished runs", is small on a
well balanced system. Surprisingly small, I would say.

I will sketch a simple implementation of parallel sorting based on the
patch series that may be workable, and requires relatively little
implementation effort compare to other ideas that were raised at
various times:

* Establish an optimal run size ahead of time using a cost model. We
need this for serial external sorts anyway, to relieve the DBA of
having to worry about sizing maintenance_work_mem according to obscure
considerations around cache efficiency within tuplesort. Parallelism
probably doesn't add much complexity to the cost model, which is not
especially complicated to begin with. Note that I have not added this
cost model yet (just the ad-hoc, tuplesort-private cost model for
using replacement selection to get a "quicksort with spillover"). It
may be best if this cost model lives in the optimizer.

* Have parallel workers do a parallel heap scan of the relation until
they fill this optimal run size. Use local memory to sort within
workers. Write runs out in the usual way. Then, the worker picks up
the next run scheduled. If there are no more runs to build, there is
no more work for the parallel workers.

* Shut down workers. Do an on-the-fly merge in the parent process.
This is the same as with a serial merge, but with a little
coordination with worker processes to make sure every run is
available, etc. In general, coordination is kept to an absolute
minimum.

I tend to think that this really simple approach would get much of the
gain of something more complicated -- no need to write shared memory
management code, minimal need to handle coordination between workers,
and no real changes to the algorithms used for each sub-problem. This
makes merging more of a bottleneck again, but that is a bottleneck on
I/O and especially memory bandwidth. Parallelism cannot help much with
that anyway (except by compressing runs with offset coding, perhaps,
but that isn't specific to parallelism and won't always help). Writing
out runs in bulk is very fast here -- certainly much faster than I
thought it would be when I started thinking about external sorting.
And if that turns out to be a problem for cases that have sufficient
memory to do everything internally, that can later be worked on
non-invasively.

As I've said in the past, I think parallel sorting only makes sense
when memory latency and bandwidth are not huge bottlenecks, which we
should bend over backwards to avoid. In a sense, you can't really make
use of parallel workers for sorting until you fix that problem first.

I am not suggesting that we do this because it's easier than other
approaches. I think it's actually most effective to not make parallel
sorting too divergent from serial sorting, because making things
cumulative makes speed-ups from localized optimizations cumulative,
while at the same time, AFAICT there isn't anything to recommend
extensive specialization for parallel sort. If what I've sketched is
also a significantly easier approach, then that's a bonus.

--
Peter Geoghegan

Attachment	Content-Type	Size
quicksort_external_test.txt	text/plain	6.3 KB
0005-Use-tuple-proper-memory-pool-in-tuplesort.patch	text/x-patch	24.7 KB
0004-Prefetch-from-memtuples-array-in-tuplesort.patch	text/x-patch	10.4 KB
0003-Log-requirement-for-multiple-external-sort-passes.patch	text/x-patch	8.3 KB
0002-Further-diminish-role-of-replacement-selection.patch	text/x-patch	26.4 KB
0001-Quicksort-when-performing-external-sorts.patch	text/x-patch	34.8 KB

In response to

Using quicksort for every external sort run at 2015-08-20 02:24:26 from Peter Geoghegan

Responses

Re: Using quicksort for every external sort run at 2015-09-06 08:51:20 from Marc Mamin

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Noah Misch	2015-09-06 00:53:58	Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Previous Message	Michael Paquier	2015-09-05 23:39:06	Re: Outdated documentation for PageHeaderData