Re: PoC: Partial sort

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: PoC: Partial sort
Date: 2016-03-01 15:06:37
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Hi, Peter!

I finally went over your review.

On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> Explain output
> -------------------
> A query like your original test query looks like this for me:
> postgres=# explain analyze select * from test order by v1, v2 limit 100;
> --------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=429.54..434.53 rows=100 width=16) (actual
> time=15125.819..22414.038 rows=100 loops=1)
> -> Partial sort (cost=429.54..50295.52 rows=1000000 width=16)
> (actual time=15125.799..22413.297 rows=100 loops=1)
> Sort Key: v1, v2
> Presorted Key: v1
> Sort Method: top-N heapsort Memory: 27kB
> -> Index Scan using test_v1_idx on test
> (cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
> rows=151 loops=1)
> Planning time: 0.948 ms
> Execution time: 22414.895 ms
> (8 rows)
> I thought about it for a while, and decided that you have the basic
> shape of the explain output right here. I see where you are going by
> having the sort node drive things.
> I don't think the node should be called "Partial sort", though. I
> think that this is better presented as just a "Sort" node with a
> presorted key.
> I think it might be a good idea to also have a "Sort Groups: 2" field
> above. That illustrates that you are in fact performing 2 small sorts
> per group, which is the reality. As you said, it's good to have this
> be high, because then the sort operations don't need to do too many
> comparisons, which could be expensive.

I agree with your notes. In the attached version of path explain output was
revised as you proposed.

> Sort Method
> ----------------
> Even thought the explain analyze above shows "top-N heapsort" as its
> sort method, that isn't really true. I actually ran this through a
> debugger, which is why the above plan took so long to execute, in case
> you wondered. I saw that in practice the first sort executed for the
> first group uses a quicksort, while only the second sort (needed for
> the 2 and last group in this example) used a top-N heapsort.
> Is it really worth using a top-N heapsort to avoid sorting the last
> little bit of tuples in the last group? Maybe we should never push
> down to an individual sort operation (we have one
> tuplesort_performsort() per group) that it should be bounded. Our
> quicksort falls back on insertion sort in the event of only having 7
> elements (at that level of recursion), so having this almost always
> use quicksort may be no bad thing.
> Even if you don't like that, the "Sort Method" shown above is just
> misleading. I wonder, also, if you need to be more careful about
> whether or not "Memory" is really the high watermark, as opposed to
> the memory used by the last sort operation of many. There could be
> many large tuples in one grouping, for example. Note that the current
> code will not show any "Memory" in explain analyze for cases that have
> memory freed before execution is done, which this is not consistent
> with. Maybe that's not so important. Unsure.
> trace_sort output shows that these queries often use a large number of
> tiny individual sorts. Maybe that's okay, or maybe we should make it
> clearer that the sorts are related. I now use trace_sort a lot.

With partial sort we run multiple sorts in the same node. Ideally, we need
to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple
times. I checked how it works for now.

# explain analyze select (select sum(x.i) from (select i from
generate_series(1,t * 1000) i order by i desc) x) from generate_series(1,
20, 1) t;
Function Scan on generate_series t (cost=0.00..74853.92 rows=1000
width=4) (actual time=0.777..83.498 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=74.83..74.84 rows=1 width=4) (actual
time=4.173..4.173 rows=1 loops=20)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual
time=2.822..3.361 rows=10500 loops=20)
Sort Key: i.i
Sort Method: quicksort Memory: 1706kB
-> Function Scan on generate_series i (cost=0.01..10.01
rows=1000 width=4) (actual time=0.499..1.106 rows=10500 loops=20)
Planning time: 0.080 ms
Execution time: 83.625 ms
(9 rows)

# explain analyze select (select sum(x.i) from (select i from
generate_series(1,t * 1000) i order by i desc) x) from generate_series(20,
1, -1) t;
Function Scan on generate_series t (cost=0.00..74853.92 rows=1000
width=4) (actual time=11.414..86.127 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=74.83..74.84 rows=1 width=4) (actual
time=4.305..4.305 rows=1 loops=20)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual
time=2.944..3.486 rows=10500 loops=20)
Sort Key: i.i
Sort Method: quicksort Memory: 71kB
-> Function Scan on generate_series i (cost=0.01..10.01
rows=1000 width=4) (actual time=0.527..1.125 rows=10500 loops=20)
Planning time: 0.080 ms
Execution time: 86.165 ms
(9 rows)

In the case of subplan explain analyze gives us just information about last
subplan run. This makes me uneasy. From one side, it's probably OK that
partial sort behaves like subplan while showing information just about last
sort run. From the other side, we need some better solution for that in
general case.

> Abbreviated Keys
> -----------------------
> It could be very bad for performance that the first non-presorted key
> uses abbreviated keys. There needs to be a way to tell tuplesort to
> not waste its time with them, just as there currently is for bounded
> (top-N heapsort) sorts. They're almost certainly the wrong way to go,
> unless you have huge groups (where partial sorting is unlikely to win
> in the first place).

Agree. I made

> Other issues in executor
> --------------------------------
> This is sort of an optimizer issue, but code lives in execAmi.c.
> Assert is redundant here:
> + case T_Sort:
> + /* We shouldn't reach here without having plan
> node */
> + Assert(node);
> + /* With skipCols sort node holds only last bucket
> */
> + if (node && ((Sort *)node)->skipCols == 0)
> + return true;
> + else
> + return false;
> I don't like that you've added a Plan node argument to
> ExecMaterializesOutput() in this function, too.

I don't like this too. But I didn't find better solution without
significant rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such
rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.

> There is similar assert/pointer test code within
> ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
> I have concerns about the way the determination of a sort's ability to
> do stuff like be scanned backwards is now made dynamic, which this new
> code demonstrates:
> /*
> + * skipCols can't be used with either EXEC_FLAG_REWIND,
> + * or EXEC_FLAG_MARK, because we hold only current bucket in
> + * tuplesortstate.
> + */
> + Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
> +
> +
> EXEC_FLAG_MARK)) == 0);
> +
> I need to think some more about this general issue.

It has to be dynamic if we want to keep full sort and partial sort in the
same node. If properties of full sort and partial sort are different then
and they share same node then this properties of Sort node have to be
Alternative idea I have is that Sort node should fallback to full sort if
it sees any of above flags. But I'm not sure this is right. In some cases
it might be cheaper to partial sort then materialize than fallback to full

Misc. issues
> ----------------
> _readSort() needs READ_INT_FIELD(). _outSort() similarly needs
> WRITE_INT_FIELD(). You've mostly missed this stuff.
> Please be more careful about this. It's always a good idea to run the
> regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
> time, which tends to highlight these problems.

Fixed. I've tried "#define COPY_PARSE_PLAN_TREES", now regression tests are
passed with it.

tuplesort.h should not include sortsupport.h. It's a modularity
> violation, and besides which is unnecessary. Similarly, pathkeys.c
> should not include optimizer/cost.h.


What is this?
> + if (inner_cheapest_total &&
> inner_cheapest_total->pathtype == T_Sort)
> + elog(ERROR, "Sort");

It's just piece of junk I used for debug. Deleted.

> Optimizer
> -------------
> I am not an expert on the optimizer, but I do have some feedback.
> * cost_sort() needs way way more comments. Doesn't even mention
> indexes. Not worth commenting further on until I know what it's
> *supposed* to do, though.

I've added some comments.

> * pathkeys_useful_for_ordering() now looks like a private convenience
> wrapper for the new public function pathkeys_common(). I think that
> comments should make this quite clear.

That's it. Explicit comment about that was added.

> * compare_bifractional_path_costs() should live beside
> compare_fractional_path_costs() and be public, I think. The existing
> compare_fractional_path_costs() also only has a small number of
> possible clients, and is still not static.

Now compare_bifractional_path_costs() is together with

> * Think it's not okay that there are new arguments, such as the
> "tuples" argument for get_cheapest_fractional_path_for_pathkeys().
> It seems a bad sign, design-wise, that a new argument of "PlannerInfo
> *root" was added at end, for the narrow purpose of passing stuff to
> estimate number of groups for the benefit of this patch. ISTM that
> grouping_planner() caller should do the
> work itself as and when it alone needs to.

Now grouping_planner() should call separate function
estimate_pathkeys_groups() which is responsible for estimating number of

> * New loop within get_cheapest_fractional_path_for_pathkeys() requires
> far more explanation.
> Explain theory behind derivation of compare_bifractional_path_costs()
> fraction arguments, please. I think there might be simple heuristics
> like this elsewhere in the optimizer or selfuncs.c, but you need to
> share why you did things that way in the code.

Idea is that since partial sort fetches data per group then it would
require fetching more data than fully presorted path.

* Within planner.c, "partial_sort_path" variable name in
> grouping_planner() might be called something else.
> Its purpose isn't clear. Also, the way that you mix path costs from
> the new get_cheapest_fractional_path_for_pathkeys() into the new
> cost_sort() needs to be explained in detail (as I already said,
> cost_sort() is currently very under-documented).

I've tried to make it more clear. partial_sort_path is renamed
to presorted_path.

> Obviously the optimizer part of this needs the most work -- no
> surprises there. I wonder if we cover all useful cases? I haven't yet
> got around to using "#define OPTIMIZER_DEBUG" to see what's really
> going on, which seems essential to understanding what is really
> happening, but it looks like merge append paths can currently use the
> optimization, whereas unique paths cannot. Have you thought about
> that?

Unique paths occasionally can use this optimization.

# create table test as (select id, random() as v from
generate_series(1,1000000) id);
# create index test_v_idx on test(v);

# explain select distinct v, id from test;
Unique (cost=0.47..55104.41 rows=1000000 width=12)
-> Sort (cost=0.47..50104.41 rows=1000000 width=12)
Sort Key: v, id
Presorted Key: v
-> Index Scan using test_v_idx on test (cost=0.42..47604.41
rows=1000000 width=12)
(5 rows)

# explain select distinct id, v from test;
Unique (cost=132154.34..139654.34 rows=1000000 width=12)
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: id, v
-> Seq Scan on test (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

But it depends on attribute order. I could work out this case, but I would
prefer some simple case to commit before. I already throw merge join
optimization away for the sake of simplicity.

Alexander Korotkov
Postgres Professional:
The Russian Postgres Company

Attachment Content-Type Size
partial-sort-basic-6.patch application/octet-stream 88.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-03-01 15:06:47 Re: checkpointer continuous flushing - V16
Previous Message Tom Lane 2016-03-01 15:02:07 Re: WIP: Upper planner pathification