Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2019-07-20 16:59:51
Message-ID: 20190720165951.m6adwbqi26bsfbjb@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:
>On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>>
>> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> ...
>> > >My current line of investigation is whether we need to do anything in
>> > >the parallel portion of create_ordered_paths(). I noticed that the
>> > >first-pass patch adding generate_useful_gather_paths() modified that
>> > >section but wasn't actually adding any new gather-merge paths (just
>> > >bare incremental sort paths). That seems pretty clearly just a
>> > >prototype miss, so I modified the prototype to build gather-merge
>> > >paths instead (as a side note that change seems to fix an oddity I was
>> > >seeing where plans would include a parallel index scan node even
>> > >though they weren't parallel plans). While the resulting plan for
>> > >something like:
>> > >
>> >
>> > Yes, that seems to be a bug. The block above it clealy has a gather
>> > merge nodes, so this one should too.
>> >
>> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
>> > >t.a, t.b limit 50;
>> > >
>> > >changes cost (to be cheaper) ever so slightly with the gather-merge
>> > >addition to create_ordered_paths(), the plan itself is otherwise
>> > >identical (including row estimates):
>> > >
>> > >Limit
>> > > -> Gather Merge
>> > > -> Incremental Sort
>> > > -> Parallel Index Scan
>> > >
>> > >(Note: I'm forcing parallel plans here with: set
>> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
>> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
>> > >min_parallel_index_scan_size=0;)
>> > >
>> > >I can't seem to come up with a case where adding these gather-merge
>> > >paths in create_ordered_paths() isn't entirely duplicative of paths
>> > >already created by generate_useful_gather_paths() as called from
>> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
>> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
>> > >are called by grouping_planner().
>> > >
>> > >Can you think of a case I'm missing here that would make it valuable
>> > >to generate new parallel plans in create_ordered_paths()?
>> > >
>> >
>> > Good question. Not sure. I think such path would need to do something on
>> > a relation that is neither a join nor a scan - in which case the path
>> > should not be created by apply_scanjoin_target_to_paths().
>> >
>> > So for example a query like this:
>> >
>> > SELECT
>> > a, b, sum(expensive_function(c))
>> > FROM
>> > t
>> > GROUP BY a,b
>> > ORDER BY a,sum(...) LIMIT 10;
>> >
>> > should be able to produce a plan like this:
>> >
>> > -> Limit
>> > -> Gather Merge
>> > -> Incremental Sort (pathkeys: a, sum)
>> > -> Group Aggregate
>> > a, b, sum(expensive_function(c))
>> > -> Index Scan (pathkeys: a, b)
>> >
>> > or something like that, maybe. I haven't tried this, though. The other
>> > question is whether such queries are useful in practice ...
>>
>> Hmm, when I step through on that example input_rel->partial_pathlist
>> != NIL is false, so we don't even attempt to consider any extra
>> parallel paths in create_ordered_paths(). Nonetheless we get a
>> parallel plan, but with a different shape:
>>
>> Limit
>> -> Incremental Sort
>> Sort Key: a, (sum(expensive_function(c)))
>> Presorted Key: a
>> -> Finalize GroupAggregate
>> Group Key: a, b
>> -> Gather Merge
>> Workers Planned: 4
>> -> Partial GroupAggregate
>> Group Key: a, b
>> -> Sort
>> Sort Key: a, b
>> -> Parallel Seq Scan on t
>>
>> (or if I disable seqscan then the sort+seq scan above becomes inc sort
>> + index scan)
>>
>> To be honest, I don't think I understand how you would get a plan like
>> that anyway since the index here only has "a" and so can't provide
>> (pathkeys: a, b).
>>
>> Perhaps there's something I'm still missing though.

I wasn't stricly adhering to the example we used before, and I imagined
there would be an index on (a,b). Sorry if that wasn't clear.

>
>Also just realized I don't think (?) we can order by the sum inside a
>gather-merge -- at least not without having another sort above the
>parallel portion? Or is the group aggregate able to also provide
>ordering on the final sum after aggregating the partial sums?
>

Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-07-20 17:30:30 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Sergei Kornilov 2019-07-20 16:44:53 Re: [HACKERS] Block level parallel vacuum