Re: PoC: Partial sort

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: PoC: Partial sort
Date: 2015-11-04 01:44:14
Message-ID: CAM3SWZRxA2+VZiUdYWrskZgczTnFOBUAcrkj2XX+Drw-E7Zhog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 20, 2015 at 4:17 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Planner regression is fixed in the attached version of patch. It appears
> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
> ordering is required.

I took a look at this. My remarks are not comprehensive, but are just
what I noticed having looked at this for the first time in well over a
year.

Before anything else, I would like to emphasize that I think that this
is pretty important work; it's not just a "nice to have". I'm very
glad you picked it up, because we need it. In the real world, there
will be *lots* of cases that this helps.

Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=429.54..434.53 rows=100 width=16) (actual
time=15125.819..22414.038 rows=100 loops=1)
-> Partial sort (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)
Sort Key: v1, v2
Presorted Key: v1
Sort Method: top-N heapsort Memory: 27kB
-> Index Scan using test_v1_idx on test
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)
Planning time: 0.948 ms
Execution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+ case T_Sort:
+ /* We shouldn't reach here without having plan node */
+ Assert(node);
+ /* With skipCols sort node holds only last bucket */
+ if (node && ((Sort *)node)->skipCols == 0)
+ return true;
+ else
+ return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:

/*
+ * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+ * or EXEC_FLAG_MARK, because we hold only current bucket in
+ * tuplesortstate.
+ */
+ Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+
EXEC_FLAG_BACKWARD |
+
EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

Misc. issues
----------------

_readSort() needs READ_INT_FIELD(). _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.

tuplesort.h should not include sortsupport.h. It's a modularity
violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

What is this?

+ if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+ elog(ERROR, "Sort");

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments. Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common(). I think that
comments should make this quite clear.

* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch. ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

That's all I have for now...

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-11-04 02:01:22 Re: pg_stat_statements query jumbling question
Previous Message Craig Ringer 2015-11-04 01:21:58 Re: RFC/WIP: adding new configuration options to TOAST