Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2016-03-17 01:42:34
Message-ID: CAM3SWZQU5mrWof+O6zR4qPjVtcNUVMs8HPJjpO5NP2BDdNHjvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 16, 2016 at 3:31 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I spent some time looking at this part of the patch yesterday and
> today.

Thanks for getting back to it.

> - I think that batchmemtuples() is somewhat weird. Normally,
> grow_memtuples() doubles the size of the array each time it's called.
> So if you somehow called this function when you still had lots of
> memory available, it would just double the size of the array.
> However, I think the expectation is that it's only going to be called
> when availMem is less than half of allowedMem, in which case we're
> going to get the special "last increment of memtupsize" behavior,
> where we expand the memtuples array by some multiple between 1.0 and
> 2.0 based on allowedMem/memNowUsed.

That's right. It might be possible for the simple doubling behavior to
happen under artificial conditions instead, for example when we have
enormous individual tuples, but if that does happen it's still
correct. I just didn't think it was worth worrying about giving back
more memory in such extreme edge-cases.

> And after staring at this for a
> long time ... yeah, I think this does the right thing. But it
> certainly is hairy.

No arguments from me here. I think this is justified, though.

It's great that palloc() provides a simple, robust abstraction.
However, there are a small number of modules in the code, including
tuplesort.c, where we need to be very careful about memory management.
Probably no more than 5 and no less than 3. In these places, large
memory allocations are the norm. We ought to pay close attention to
memory locality, heap fragmentation, that memory is well balanced
among competing considerations, etc. It's entirely appropriate that
we'd go to significant lengths to get it right in these places using
somewhat ad-hoc techniques, simply because these are the places where
we'll get a commensurate benefit. Some people might call this adding
custom memory allocators, but I find that to be a loaded term because
it suggests intimate involvement from mcxt.c.

> - It's not exactly clear what you mean when you say that the tuplesort
> context contains "everything else". I don't understand why that only
> ends up containing half the memory ... what, other than the memtuples
> array, ends up there?

I meant that the existing memory context "sortcontext" contains
everything else that has anything to do with sorting. Everything that
it contains in the master branch it continues to contain today, with
the sole exception of a vast majority of caller's tuples. So,
"sortcontext" continues to include everything you can think of:

* As you pointed out, the memtuples array.

* SortSupport state (assuming idiomatic usage of the API, at least).

* State specific to the cluster case.

* Transient state, specific to the index case (i.e. scankey memory)

* logtape.c stuff.

* Dynamically allocated stuff for managing tapes (see inittapes())

* For the sake of simplicity, a tiny number of remaining tuples (from
"overflow" allocations, or from when we need to free a tape's entire
batch when it is one tuple from exhaustion).

This is for tuples that the tuplesort caller needs to pfree() anyway,
per the tuplesort_get*tuple() API. It's just easier to put these
allocations in the "main" context, to avoid having to reason about any
consequences to calling MemoryContextReset() against our new tuple
context. This precaution is just future-proofing IMV.

I believe that this list is exhaustive.

> - If I understand correctly, the point of the MemoryContextReset call
> is: there wouldn't be any tuples in memory at that point anyway. But
> the OS-allocated chunks might be divided up into a bunch of small
> chunks that then got stuck on freelists, and those chunks might not be
> the right size for the next merge pass. Resetting the context avoids
> that problem by blowing up the freslists. Right?

Your summary of the practical benefit is accurate. While I've
emphasized regressions at the low-end with this latest revision, it's
also true that resetting helps in memory rich environments, when we
switch from retail palloc() calls to the final merge step's batch
allocation, which palloc() seemed to do very badly with. It makes
sense that this abrupt change in the pattern of allocations could
cause significant heap memory fragmentation.

> Clever.

Thanks.

Introducing a separate memory context that is strictly used for caller
tuples makes it clear and obvious that it's okay to call
MemoryContextReset() when state->memtupcount == 0. It's not okay to
put anything in the new context that could break the calls to
MemoryContextReset().

You might not have noticed that a second MemoryContextReset() call
appears in the quicksort patch, which helps a bit too. I couldn't
easily make that work with the replacement selection heap, because
master's tupleosrt.c never fully empties its RS heap until the last
run. I can only perform the first call to MemoryContextReset() in the
memory patch because it happens at a point memtupcount == 0 -- it's
called when a run is merged (outside a final on-the-fly merge). Notice
that the mergeonerun() loop invariant is:

while (state->memtupcount > 0)
{
...
}

So, it must be that state->memtupcount == 0 (and that we have no batch
memory) when I call MemoryContextReset() immediately afterwards.

> - I haven't yet figured out why we use batching only for the final
> on-the-fly merge pass, instead of doing it for all merges. I expect
> you have a reason. I just don't know what it is.

The most obvious reason, and possibly the only reason, is that I have
license to lock down memory accounting in the final on-the-fly merge
phase. Almost equi-sized runs are the norm, and code like this is no
longer obligated to work:

FREEMEM(state, GetMemoryChunkSpace(stup->tuple));

That's why I explicitly give up on "conventional accounting". USEMEM()
and FREEMEM() calls become unnecessary for this case that is well
locked down. Oh, and I know that I won't use most tapes, so I can give
myself a FREEMEM() refund before doing the new grow_memtuples() thing.

I want to make batch memory usable for runs, too. I haven't done that
either for similar reasons. FWIW, I see no great reason to worry about
non-final merges.

> - I have also not yet figured out why you chose to replace
> state->datumTypByVal with state->tuples and reverse the sense. I bet
> there's a reason for this, too. I don't know what it is, either.

It makes things slightly easier to make this a generic property of any
tuplesort: "Can SortTuple.tuple ever be set?", rather than allowing it
to remain a specific property of a datum tuplesort.
state->datumTypByVal often isn't initialized in master, and so cannot
be checked as things stand (unless the code is in a
datum-case-specific routine).

This new flag controls batch memory in slightly higher-level way than
would otherwise be possible. It also controls the memory prefetching
added by patch/commit 0003-*, FWIW.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-03-17 02:03:37 Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Previous Message Tatsuo Ishii 2016-03-17 01:38:30 Re: pgbench -C -M prepared gives an error