Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu>
Subject: Re: Using quicksort for every external sort run
Date: 2015-12-28 23:03:25
Message-ID: CAM3SWZS4Mr-k6nUFBPcYkCoSQsjHXVJDqDa=PN_56Fiy1dsFgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> BTW, I'm not necessarily determined to make the new special-purpose
> allocator work exactly as proposed. It seemed useful to prioritize
> simplicity, and currently so there is one big "huge palloc()" with
> which we blow our memory budget, and that's it. However, I could
> probably be more clever about "freeing ranges" initially preserved for
> a now-exhausted tape. That kind of thing.

Attached is a revision that significantly overhauls the memory patch,
with several smaller changes.

We can now grow memtuples to rebalance the size of the array
(memtupsize) against the need for memory for tuples. Doing this makes
a big difference with a 500MB work_mem setting in this datum sort
case, as my newly expanded trace_sort instrumentation shows:

LOG: grew memtuples 1.40x from 9362286 (219429 KB) to 13107200
(307200 KB) for final merge
LOG: tape 0 initially used 34110 KB of 34110 KB batch (1.000) and
13107200 slots remaining
LOG: tape 1 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG: tape 2 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining
LOG: tape 3 initially used 34110 KB of 34110 KB batch (1.000) and has
1533 slots remaining
LOG: tape 4 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG: tape 5 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining

This is a big improvement. With the new batchmemtuples() call
commented out (i.e. no new grow_memtuples() call), the LOG output
around the same point is:

LOG: tape 0 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG: tape 1 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG: tape 2 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG: tape 3 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG: tape 4 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG: tape 5 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining

(I actually added a bit more detail to what you see here during final clean-up)

Obviously we're using memory a lot more efficiently here as compared
to my last revision (or the master branch -- it always has palloc()
overhead, of course). With no grow_memtuples, we're not wasting ~1530
slots per tape anymore (which is a tiny fraction of 1% of the total),
but we are wasting 50% of all batch memory, or almost 30% of all
work_mem.

Note that this improvement is possible despite the fact that memory is
still MAXALIGN()'d -- I'm mostly just clawing back what I can, having
avoided much STANDARDCHUNKHEADERSIZE overhead for the final on-the-fly
merge. I tend to think that the bigger problem here is that we use so
many memtuples when merging in the first place though (e.g. 60% in the
above case), because memtuples are much less useful than something
like a simple array of pointers when merging; I can certainly see why
you'd need 6 memtuples here, for the merge heap, but the other ~13
million seem mostly unnecessary. Anyway, what I have now is as far as
I want to go to accelerate merging for 9.6, since parallel CREATE
INDEX is where the next big win will come from. As wasteful as this
can be, I think it's of secondary importance.

With this revision, I've given up on the idea of trying to map
USEMEM()/FREEMEM() to "logical" allocations and deallocations that
consume from each tape's batch. The existing merge code in the master
branch is concerned exclusively with making each tape's use of memory
fair; each tape only gets so many "slots" (memtuples), and so much
memory, and that's it (there is never any shuffling of those resource
budgets between tapes). I get the same outcome from simply only
allowing tapes to get memory from their own batch allocation, which
isn't much complexity, because only READTUP() routines regularly need
memory. We detect when memory has been exhausted within
mergeprereadone() in a special way, not using LACKMEM() at all -- this
seems simpler. (Specifically, we use something called overflow
allocations for this purpose. This means that there are still a very
limited number of retail palloc() calls.)

This new version somewhat formalizes the idea that batch allocation
may one day have uses beyond the final on-the-fly merge phase, which
makes a lot of sense. We should really be saving a significant amount
of memory when initially sorting runs, too. This revision also
pfree()s tape memory early if the tape is exhausted early, which will
help a lot when there is a logical/physical correlation.

Overall, I'm far happier with how memory is managed in this revision,
mostly because it's easier to reason about. trace_sort now closely
monitors where memory goes, and I think that's a good idea in general.
That makes production performance problems a lot easier to reason
about -- the accounting should be available to expert users (that
enable trace_sort). I'll have little sympathy for the suggestion that
this will overwhelm users, because trace_sort is already only suitable
for experts. Besides, it isn't that complicated to figure this stuff
out, or at least gain an intuition for what might be going on based on
differences seen in a problematic case. Getting a better picture of
what "bad" looks like can guide an investigation without the DBA
necessarily understanding the underlying algorithms. At worst, it
gives them something specific to complain about here.

Other changes:

* No longer use "tuple proper" terminology. Also, memory pools are now
referred to as batch memory allocations. This is at the request of
Jeff and Robert.

* Fixed silly bug in useselection() cost model that causes "quicksort
with spillover" to never be used. The cost model is otherwise
unchanged, because I didn't come up with any bright ideas about how to
do better there. Ideas from other people are very much welcome.

* Cap the maximum number of tapes to 500. I think it's silly that the
number of tapes is currently a function of work_mem, without any
further consideration of the details of the sort, but capping is a
simpler solution than making tuplesort_merge_order() smarter. I
previously saw quite a lot of waste with high work_mem settings, with
tens of thousands of tapes that will never be used, precisely because
we have lots of memory (the justification for having, say, 40k tapes
seems to be almost an oxymoron). Tapes (or the accounting for
never-allocated tapes) could take almost 10% of all memory. Also, less
importantly, we now refund/FREEMEM() unallocated tape memory ahead of
final on-the-fly merge preallocation of batch memory.

Note that we contemplated bounding the number of tapes in the past
several times. See the commit message of c65ab0bfa9, a commit from
almost a decade ago, for an example of this. That message also
describes how "slots" (memtuples) and memory for tuples must be kept
in balance while merging, which is very much relevant to my new
grow_memtuples() call.

--
Peter Geoghegan

Attachment Content-Type Size
0003-Perform-memory-prefetching-when-writing-memtuples.patch text/x-patch 7.8 KB
0002-Allocate-memory-in-batch-for-tuplesort-merge.patch text/x-patch 31.1 KB
0001-Quicksort-when-performing-external-sorts.patch text/x-patch 61.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-12-28 23:08:56 Re: Some 9.5beta2 backend processes not terminating properly?
Previous Message Boriss Mejias 2015-12-28 22:56:50 Testing Postgresql 9.5 RC1 with Alfresco 5.0.d