Re: PoC: Partial sort

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: PoC: Partial sort
Date: 2016-03-30 15:02:06
Message-ID: CAPpHfdtCcHZ-mLWzsFrRCvHpV1LPSaOGooMZ3sa40AkwR=7ouQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Peter!

Thank you for review!

On Thu, Mar 24, 2016 at 3:39 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> >> Sort Method
> >> ----------------
> >>
> >> Even thought the explain analyze above shows "top-N heapsort" as its
> >> sort method, that isn't really true. I actually ran this through a
> >> debugger, which is why the above plan took so long to execute, in case
> >> you wondered. I saw that in practice the first sort executed for the
> >> first group uses a quicksort, while only the second sort (needed for
> >> the 2 and last group in this example) used a top-N heapsort.
>
> > With partial sort we run multiple sorts in the same node. Ideally, we
> need
> > to provide some aggregated information over runs.
> > This situation looks very similar to subplan which is called multiple
> times.
> > I checked how it works for now.
>
> Noticed this in nodeSort.c:
>
> + if (node->tuplesortstate != NULL)
> + {
> + tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
> + node->groupsCount++;
> + }
> + else
> + {
> + /* Support structures for cmpSortSkipCols - already
> sorted columns */
> + if (skipCols)
> + prepareSkipCols(plannode, node);
>
> + /*
> + * Only pass on remaining columns that are unsorted.
> Skip abbreviated
> + * keys usage for partial sort. We unlikely will have
> huge groups
> + * with partial sort. Therefore usage of abbreviated
> keys would be
> + * likely a waste of time.
> + */
> tuplesortstate = tuplesort_begin_heap(tupDesc,
>
> You should comment on which case is which, and put common case (no
> skip cols) first. Similarly, the ExecSort() for(;;) should put the
> common (non-partial) case first, which it does, but then the "first
> tuple in partial sort" case first, then the "second or subsequent
> partial sort" case last.
>

Done.

More comments here, please:
>
> +typedef struct SkipKeyData
> +{
> + FunctionCallInfoData fcinfo;
> + FmgrInfo flinfo;
> + OffsetNumber attno;
> +} SkipKeyData;
>
> (What's SkipKeyData?)
>
> Also want comments for new SortState fields.

Done.

> SortState.prev is a
> palloc()'d copy of tuple, which should be directly noted, as it is for
> similar aggregate cases, etc.
>
> Should you be more aggressive about freeing memory allocated for
> SortState.prev tuples?
>

Fixed.

> The new function cmpSortSkipCols() should say "Special case for
> NULL-vs-NULL, else use standard comparison", or something. "Lets
> pretend NULL is a value for implementation convenience" cases are
> considered the exception, and are always noted as the exception.
>

Comment is added.

> > In the case of subplan explain analyze gives us just information about
> last
> > subplan run. This makes me uneasy. From one side, it's probably OK that
> > partial sort behaves like subplan while showing information just about
> last
> > sort run. From the other side, we need some better solution for that in
> > general case.
>
> I see what you mean, but I wasn't so much complaining about that, as
> complaining about the simple fact that we use a top-N heap sort *at
> all*. This feels like the "limit" case is playing with partial sort
> sub-sorts in a way that it shouldn't.
>
> I see you have code like this to make this work:
>
> + /*
> + * Adjust bound_Done with number of tuples we've actually sorted.
> + */
> + if (node->bounded)
> + {
> + if (node->finished)
> + node->bound_Done = node->bound;
> + else
> + node->bound_Done = Min(node->bound,
> node->bound_Done + nTuples);
>
> But, why bother? Why not simply prevent tuplesort.c from ever using
> the top-N heapsort method when it is called from nodeSort.c for a
> partial sort (probably in the planner)?
>
> Why, at a high level, does it make sense to pass down a limit to *any*
> sort operation that makes up a partial sort, even the last? This seems
> to be adding complexity without a benefit. A big advantage of top-N
> heapsorts is that much less memory could be used, but this patch
> already has the memory allocated that belonged to previous performsort
> calls (mostly -- certainly has all those tuplesort.c memtuples
> throughout, a major user of memory overall). It's not going to be
> very good at preventing work, except almost by accident because we
> happen to have a limit up to just past the beginning of a smaller
> partial sort group. I'd rather use quicksort, which is very versatile.
> Top-N sorts make sense when sorting itself is the bottleneck, which it
> probably won't be for a partial sort (that's the whole point).
>

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort methods
is very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory. In the case of planner mistake it causes OOM. I met such
situation in production and not once. This is why I'd like partial sort to
have graceful degradation for such cases.

If the sort method was very likely the same for every performsort
> (quicksort), which it otherwise would be, then I'd care way way less
> that that could be a bit misleading in EXPLAIN ANALYZE output, because
> typically the last one would be "close enough". Although, this isn't
> quite like your SubPlan example, because the Sort node isn't reported
> as e.g. "SubPlan 1" by EXPLAIN.
>

I've adjusted EXPLAIN ANALYZE to show maximum resources consumption.

>
> I think that this has bugs for external sorts:
>
> +void
> +tuplesort_reset(Tuplesortstate *state)
> +{
> + int i;
> +
> + if (state->tapeset)
> + LogicalTapeSetClose(state->tapeset);
> +
> + for (i = 0; i < state->memtupcount; i++)
> + free_sort_tuple(state, state->memtuples + i);
> +
> + state->status = TSS_INITIAL;
> + state->memtupcount = 0;
> + state->boundUsed = false;
> + state->tapeset = NULL;
> + state->currentRun = 0;
> + state->result_tape = -1;
> + state->bounded = false;
> +}
>
> It's not okay to reset like this, especially not after the recent
> commit 0011c0091, which could make this code unacceptably leak memory.
> I realize that we really should never use an external sort here, but,
> as you know, this is not the point.
>
> So, I want to suggest that you use the regular code to destroy and
> recreate a tuplesort in this case. Now, obviously that has some
> significant disadvantages -- you want to reuse everything in the
> common case when each sort is tiny. But you can still do that for that
> very common case.
>
> I think you need to use sortcontext memory context here on general
> principle, even if current usage isn't broken by that.
>
> Even if you get this right for external sorts once, it will break
> again without anyone noticing until it's too late. Better to not rely
> on it staying in sync, and find a way of having the standard
> tuplesort.c initialization begin again.
>
> Even though these free_sort_tuple() calls are still needed, you might
> also call "MemoryContextReset(state->tuplecontext)" at the end. That
> might prevent palloc() fragmentation when groups are of wildly
> different sizes. Just an idea.
>

I tried to manage this by introducing another memory context which is
persistent between partial sort batches. Other memory contexts are reset.

> >> I don't like that you've added a Plan node argument to
> >> ExecMaterializesOutput() in this function, too.
> >
> >
> > I don't like this too. But I didn't find better solution without
> significant
> > rework of planner.
> > However, "Upper planner pathification" by Tom Lane seems to have such
> > rework. It's likely sort becomes separate path node there.
> > Then ExecMaterializesOutput could read parameters of path node.
>
> A tuplesort may be randomAccess, or !randomAccess, as the caller
> wishes. It's good for performance if the caller does not want
> randomAccess, because then we can do our final merge on-the-fly if
> it's an external sort, which helps a lot.
>
> How is this different? ExecMaterializesOutput() seems to be about
> whether or not the plan *could* materialize its output in principle,
> even though you might well want to make it not do so in specific
> cases. So, it's not so much that the new argument is ugly; rather, I
> worry that it's wrong to make ExecMaterializesOutput() give a more
> specific answer than anticipated by current callers.
>
> Is the difference basically just that a partial sort could be
> enormously faster, whereas a !randomAccess conventional sort is nice
> to have, but not worth e.g. changing cost_sort() to account for?
>
> You might just make a new function, ExecPlanMaterializesOutput(),
> instead. That would call ExecMaterializesOutput() for non-Sort cases.
>

I've added ExecPlanMaterializesOutput() function.

> >> Optimizer
> >> -------------
> >>
>
> >> * cost_sort() needs way way more comments. Doesn't even mention
> >> indexes. Not worth commenting further on until I know what it's
> >> *supposed* to do, though.
> >
> >
> > I've added some comments.
>
> Looking at cost_sort() now, it's a bit clearer. I think that you
> should make sure that everything is costed as a quicksort, though, if
> you accept that we should try and make every small sort done by the
> partial sort a quicksort. Which I think is a good idea. The common
> case is that groups are small, but the qsort() insertion sort will be
> very very fast for that case.
>

I'm not sure that in partial sort we should estimate sorting of one group
in other way than simple sort does. I see following reasons:
1) I'd like partial sort to be able to use other sorting methods to provide
graceful degradation in the case of planner mistakes as I pointed above.
2) Planner should don't choose partial sort plans in some cases. That
should happen because higher cost of these plans including high cost of
partial sort. Estimation of other sort methods looks like good way for
reporting such high costs.

> This looks like an old change you missed:
>
> - * compare_path_fractional_costs
> + * compare_fractional_path_costs
>

I think this is rather a typo fix. Because now comment doesn't meet
function name.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
partial-sort-basic-8.patch application/octet-stream 85.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien 2016-03-30 15:29:34 Desirable pgbench features?
Previous Message Aleksander Alekseev 2016-03-30 14:59:20 Re: WIP: Access method extendability