Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2020-03-29 03:14:37
Message-ID: 20200329031437.yvhogo25wdijwpqd@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:
>On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Hi,
>>
>> Attached is my take on simplification of the useful pathkeyes thing. It
>> keeps the function, but it truncates query_pathkeys to only members with
>> EC members in the relation. I think that's essentially the optimization
>> you've proposed.
>
>Thanks. I've included that in the patch series in this email (as a
>separate patch) with a few additional comments.
>

Thanks.

>I've also noticed that the enabled_incrementalsort check in
>generate_useful_gather_paths seemed broken, because it returned us out
>of the function before creating either a plain gather merge (if
>already sorted) or an explicit sort path. I've included a patch that
>moves it to the if block that actually builds the incremental sort
>path.
>

Hmmm, that's probably right.

The reason why the GUC check was right after generate_gather_paths is
that the intent to disable all the useful-pathkeys business, essentially
reducing it back to plain generate_gather_paths.

But I think you're right that's wrong, because it might lead to strange
behavior when the GUC switches between plans without any incremental
sort nodes - setting it to 'true' might end up picking GM on top of
plain Sort, for example.

>>
>> ...
>>
>> 1) Missing worker identification (Worker #).
>
>Fixed.
>
>> 2) Missing method for workers (we have it for the leader, though).
>
>Fixed. Since we can't have pointers in the parallel shared memory
>space, we can't store the sort methods used in a list. To accomplish
>the same goal, I've assigned the TuplesortMethod enum entries uique
>bit positions, and store the methods used in a bitmask.
>

OK, makes sense.

>> 3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
>> why it's in parenthesis.
>
>I've removed the parentheses. It's labeled "Methods" since there can
>be more than one (different batches could use different methods). I've
>updated this to properly use singular/plural depending on the number
>of methods used.
>

I'm a bit confused. How do I know which batch used which method? Is it
actually worth printing in explain analyze? Maybe only print it in the
verbose mode?

>> 4) Not sure having two lines for each worker is a great idea.
>
>I've left these in for now because the lines are already very long
>(much, much longer than the worker lines in a standard sort node).
>This is largely because we're trying to summarize many sort batches,
>while standard sort nodes only have to give the exact stats from a
>single batch.
>
>See the example output later in the email.
>

OK

>> 5) I'd probably prefer having multiple labels for avg/max memory values,
>> instead of (avg) and (max) notes. Also, I think we use "peak" in this
>> context instead of "max".
>
>Updated.
>

OK

>Here's the current output:
>
> Limit (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
>time=13218.403..13222.519 rows=10000 loo
>ps=1)
> -> Gather Merge (cost=1887419.20..19624748.03 rows=83333360
>width=8) (actual time=13218.401..13229.7
>50 rows=10000 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Incremental Sort (cost=1886419.17..10005010.55
>rows=41666680 width=8) (actual time=13208.00
>4..13208.586 rows=4425 loops=3)
> Sort Key: a, b
> Presorted Key: a
> Full-sort Groups: 1 Sort Method: quicksort Memory:
>avg=28kB peak=28kB
> Presorted Groups: 1 Sort Method: top-N heapsort Memory:
>avg=1681kB peak=1681kB
> Worker 0: Full-sort Groups: 1 Sort Method: quicksort
>Memory: avg=28kB peak=28kB
> Presorted Groups: 1 Sort Method: top-N heapsort
>Memory: avg=1680kB peak=1680kB
> Worker 1: Full-sort Groups: 1 Sort Method: quicksort
>Memory: avg=28kB peak=28kB
> Presorted Groups: 1 Sort Method: top-N heapsort
>Memory: avg=1682kB peak=1682kB
> -> Parallel Index Scan using index_s_a on s
>(cost=0.57..4967182.06 rows=41666680 width=8
>) (actual time=0.455..11730.878 rows=6666668 loops=3)
>
>James

Looks reasonable. Did you try it in other output formats - json/yaml?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-03-29 03:23:32 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message James Coleman 2020-03-29 02:47:49 Re: [PATCH] Incremental sort (was: PoC: Partial sort)