Re: PoC: Partial sort

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: PoC: Partial sort
Date: 2016-03-24 00:39:30
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers


On Tue, Mar 1, 2016 at 7:06 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> I finally went over your review.

I'll respond to your points here. Note that I'm reviewing
"partial-sort-basic-7.patch", which you sent on March 13. I respond
here because this is where you answered my questions (I had no
feedback on "partial-sort-basic-6.patch", which didn't use the new
upper planner pathification stuff, unlike this latest version).

> On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> Explain output
>> -------------------

>> I think it might be a good idea to also have a "Sort Groups: 2" field
>> above. That illustrates that you are in fact performing 2 small sorts
>> per group, which is the reality. As you said, it's good to have this
>> be high, because then the sort operations don't need to do too many
>> comparisons, which could be expensive.
> I agree with your notes. In the attached version of path explain output was
> revised as you proposed.


>> Sort Method
>> ----------------
>> Even thought the explain analyze above shows "top-N heapsort" as its
>> sort method, that isn't really true. I actually ran this through a
>> debugger, which is why the above plan took so long to execute, in case
>> you wondered. I saw that in practice the first sort executed for the
>> first group uses a quicksort, while only the second sort (needed for
>> the 2 and last group in this example) used a top-N heapsort.

> With partial sort we run multiple sorts in the same node. Ideally, we need
> to provide some aggregated information over runs.
> This situation looks very similar to subplan which is called multiple times.
> I checked how it works for now.

Noticed this in nodeSort.c:

+ if (node->tuplesortstate != NULL)
+ {
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Support structures for cmpSortSkipCols - already
sorted columns */
+ if (skipCols)
+ prepareSkipCols(plannode, node);

+ /*
+ * Only pass on remaining columns that are unsorted.
Skip abbreviated
+ * keys usage for partial sort. We unlikely will have
huge groups
+ * with partial sort. Therefore usage of abbreviated
keys would be
+ * likely a waste of time.
+ */
tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.

More comments here, please:

+typedef struct SkipKeyData
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields. SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

> In the case of subplan explain analyze gives us just information about last
> subplan run. This makes me uneasy. From one side, it's probably OK that
> partial sort behaves like subplan while showing information just about last
> sort run. From the other side, we need some better solution for that in
> general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall). It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

If the sort method was very likely the same for every performsort
(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I think that this has bugs for external sorts:

+tuplesort_reset(Tuplesortstate *state)
+ int i;
+ if (state->tapeset)
+ LogicalTapeSetClose(state->tapeset);
+ for (i = 0; i < state->memtupcount; i++)
+ free_sort_tuple(state, state->memtuples + i);
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

>> I don't like that you've added a Plan node argument to
>> ExecMaterializesOutput() in this function, too.
> I don't like this too. But I didn't find better solution without significant
> rework of planner.
> However, "Upper planner pathification" by Tom Lane seems to have such
> rework. It's likely sort becomes separate path node there.
> Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

>> Optimizer
>> -------------

>> * cost_sort() needs way way more comments. Doesn't even mention
>> indexes. Not worth commenting further on until I know what it's
>> *supposed* to do, though.
> I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

>> * New loop within get_cheapest_fractional_path_for_pathkeys() requires
>> far more explanation.
>> Explain theory behind derivation of compare_bifractional_path_costs()
>> fraction arguments, please. I think there might be simple heuristics
>> like this elsewhere in the optimizer or selfuncs.c, but you need to
>> share why you did things that way in the code.
> Idea is that since partial sort fetches data per group then it would require
> fetching more data than fully presorted path.

I think I get it.

>> * Within planner.c, "partial_sort_path" variable name in
>> grouping_planner() might be called something else.
>> Its purpose isn't clear. Also, the way that you mix path costs from
>> the new get_cheapest_fractional_path_for_pathkeys() into the new
>> cost_sort() needs to be explained in detail (as I already said,
>> cost_sort() is currently very under-documented).
> I've tried to make it more clear. partial_sort_path is renamed to
> presorted_path.

> Unique paths occasionally can use this optimization.

> But it depends on attribute order. I could work out this case, but I would
> prefer some simple case to commit before. I already throw merge join
> optimization away for the sake of simplicity.

I think that was the right decision under our time constraints.
However, I suggest noting that this should happen for unique paths in
the future, say within create_unique_path().

Other notes:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting

Peter Geoghegan

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-03-24 00:40:08 Re: Postgres_fdw join pushdown - getting server crash in left outer join of three table
Previous Message Michael Paquier 2016-03-24 00:38:22 Re: Bug in searching path in jsonb_set when walking through JSONB array