Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2019-07-20 23:37:08
Message-ID: CAAaqYe9szJ9=a-ayZUuLbV8rBh3YYgxtCHzdh5Mo2xMS7NL=rQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 20, 2019 at 1:00 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:
> >On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >>
> >> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> >> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> ...
> >> > >My current line of investigation is whether we need to do anything in
> >> > >the parallel portion of create_ordered_paths(). I noticed that the
> >> > >first-pass patch adding generate_useful_gather_paths() modified that
> >> > >section but wasn't actually adding any new gather-merge paths (just
> >> > >bare incremental sort paths). That seems pretty clearly just a
> >> > >prototype miss, so I modified the prototype to build gather-merge
> >> > >paths instead (as a side note that change seems to fix an oddity I was
> >> > >seeing where plans would include a parallel index scan node even
> >> > >though they weren't parallel plans). While the resulting plan for
> >> > >something like:
> >> > >
> >> >
> >> > Yes, that seems to be a bug. The block above it clealy has a gather
> >> > merge nodes, so this one should too.
> >> >
> >> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> >> > >t.a, t.b limit 50;
> >> > >
> >> > >changes cost (to be cheaper) ever so slightly with the gather-merge
> >> > >addition to create_ordered_paths(), the plan itself is otherwise
> >> > >identical (including row estimates):
> >> > >
> >> > >Limit
> >> > > -> Gather Merge
> >> > > -> Incremental Sort
> >> > > -> Parallel Index Scan
> >> > >
> >> > >(Note: I'm forcing parallel plans here with: set
> >> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> >> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> >> > >min_parallel_index_scan_size=0;)
> >> > >
> >> > >I can't seem to come up with a case where adding these gather-merge
> >> > >paths in create_ordered_paths() isn't entirely duplicative of paths
> >> > >already created by generate_useful_gather_paths() as called from
> >> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> >> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> >> > >are called by grouping_planner().
> >> > >
> >> > >Can you think of a case I'm missing here that would make it valuable
> >> > >to generate new parallel plans in create_ordered_paths()?
> >> > >
> >> >
> >> > Good question. Not sure. I think such path would need to do something on
> >> > a relation that is neither a join nor a scan - in which case the path
> >> > should not be created by apply_scanjoin_target_to_paths().
> >> >
> >> > So for example a query like this:
> >> >
> >> > SELECT
> >> > a, b, sum(expensive_function(c))
> >> > FROM
> >> > t
> >> > GROUP BY a,b
> >> > ORDER BY a,sum(...) LIMIT 10;
> >> >
> >> > should be able to produce a plan like this:
> >> >
> >> > -> Limit
> >> > -> Gather Merge
> >> > -> Incremental Sort (pathkeys: a, sum)
> >> > -> Group Aggregate
> >> > a, b, sum(expensive_function(c))
> >> > -> Index Scan (pathkeys: a, b)
> >> >
> >> > or something like that, maybe. I haven't tried this, though. The other
> >> > question is whether such queries are useful in practice ...
> >>
> >> Hmm, when I step through on that example input_rel->partial_pathlist
> >> != NIL is false, so we don't even attempt to consider any extra
> >> parallel paths in create_ordered_paths(). Nonetheless we get a
> >> parallel plan, but with a different shape:
> >>
> >> Limit
> >> -> Incremental Sort
> >> Sort Key: a, (sum(expensive_function(c)))
> >> Presorted Key: a
> >> -> Finalize GroupAggregate
> >> Group Key: a, b
> >> -> Gather Merge
> >> Workers Planned: 4
> >> -> Partial GroupAggregate
> >> Group Key: a, b
> >> -> Sort
> >> Sort Key: a, b
> >> -> Parallel Seq Scan on t
> >>
> >> (or if I disable seqscan then the sort+seq scan above becomes inc sort
> >> + index scan)
> >>
> >> To be honest, I don't think I understand how you would get a plan like
> >> that anyway since the index here only has "a" and so can't provide
> >> (pathkeys: a, b).
> >>
> >> Perhaps there's something I'm still missing though.
>
> I wasn't stricly adhering to the example we used before, and I imagined
> there would be an index on (a,b). Sorry if that wasn't clear.
>
> >
> >Also just realized I don't think (?) we can order by the sum inside a
> >gather-merge -- at least not without having another sort above the
> >parallel portion? Or is the group aggregate able to also provide
> >ordering on the final sum after aggregating the partial sums?
> >
>
> Yes, you're right - an extra sort node would be necessary. But I don't
> think that changes the idea behind that example. The question is whether
> the extra sorts below the gather would be cheaper than doing a large sort
> on top of it - but I don't see why wouldn't that be the case, and if we
> only need a couple of rows from the beginning ...

Ah, I see. Right now we get:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 2
-> Partial GroupAggregate
Group Key: a, b
-> Parallel Index Scan using t_a_b on t

even with the parallel additions to create_ordered_paths() -- that
addition doesn't actually add any new parallel paths because
input_rel->partial_pathlist != NIL is false (I'm not sure why yet), so
if we want (if I understand correctly) something more like:

I'm still struggling to understand though how another incremental sort
below the gather-merge would actually be able to help us. For one I'm
not sure it would be less expensive, but more importantly I'm not sure
how we could do that and maintain correctness. Wouldn't a per-worker
sum not actually be useful in sorting since it has no predictable
impact on the ordering of the total sum?

Describing that got me thinking of similar cases where ordering of the
partial aggregate would (in theory) be a correct partial sort for the
total ordering, and it seems like min() and max() would be. So I ran
the same experiment with that instead of sum(), but, you guessed it,
input_rel->partial_pathlist != NIL is false again, so we don't add any
parallel paths in create_ordered_paths().

I'm leaning towards thinking considering parallel incremental sorts in
create_ordered_paths() won't add value. But I also feel like this
whole project has me jumping into the deep end of the pool again (this
time the planner), so I'm still picking up a lot of pieces for how all
this fits together, and as such I don't have a great intuitive grasp
yet of how this particular part of the planning process maps to the
kind of queries and plans we consider. All that to say: if you have
further thoughts, I'm happy to look into it, but right now I'm not
seeing anything.

James Coleman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2019-07-21 01:10:33 Re: Catching missing Datum conversions
Previous Message Alexander Korotkov 2019-07-20 23:07:44 Re: [PATCH v2] Introduce spgist quadtree @<(point,circle) operator