Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pavel Luzanov <p(dot)luzanov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Subject: Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Date: 2022-12-13 07:53:40
Message-ID: CAApHDvr1Sm+g9hbv4REOVuvQKeDWXcKUAhmbK5K+dfun0s9CvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 9 Nov 2022 at 14:58, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> v2 attached.

I've been looking at this again and this time around understand why
the * 1.5 pessimism factor was included in the incremental sort code.

If we create a table with a very large skew in the number of rows per
what will be our pre-sorted groups.

create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,1000000)x;

Here the 0 group has close to 1 million rows, but the remaining groups
1-1000 have just 1 row each. The planner only knows there are about
1001 distinct values in "a" and assumes an even distribution of rows
between those values.

With:
explain (analyze, timing off) select * from ab order by a,b;

In master, the plan is:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Sort (cost=122490.27..124990.27 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Sort Key: a, b
Sort Method: quicksort Memory: 55827kB
-> Index Scan using ab_a_idx on ab (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
Planning Time: 0.069 ms
Execution Time: 155.469 ms

With the v2 patch it's:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=2767.38..109344.55 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 33 Sort Method: quicksort Average Memory: 27kB
Peak Memory: 27kB
Pre-sorted Groups: 1 Sort Method: quicksort Average Memory:
55795kB Peak Memory: 55795kB
-> Index Scan using ab_a_idx on ab (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
Planning Time: 0.072 ms
Execution Time: 163.614 ms

So there is a performance regression.

Sometimes teaching the planner new tricks means that it might use
those tricks at a bad time. Normally we put in an off switch for
these situations to allow users an escape hatch. We have
enable_incremental_sort for this. It seems like incremental sort has
tried to avoid this problem by always considering the same "Sort"
paths that we did prior to incremental sort, and also considers
incremental sort for pre-sorted paths with the 1.5 pessimism factor.
The v2 patch taking away the safety net.

I think what we need to do is: Do our best to give incremental sort
the most realistic costs we can and accept that it might choose a
worse plan in some cases. Users can turn it off if they really have no
other means to convince the planner it's wrong.

Additionally, I think what we also need to add a GUC such as
enable_presorted_aggregate. People can use that when their Index Scan
-> Incremental Sort -> Aggregate plan is worse than their previous Seq
Scan -> Sort -> Aggregate plan that they were getting in < 16.
Turning off enable_incremental_sort alone won't give them the same
aggregate plan that they had in pg15 as we always set the
query_pathkeys to request a sort order that will suit the order by /
distinct aggregates.

I'll draft up a patch for the enable_presorted_aggregate.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-12-13 08:13:21 Re: Use get_call_result_type() more widely
Previous Message Bharath Rupireddy 2022-12-13 07:36:48 Use get_call_result_type() more widely