Aggregate node doesn't include cost for sorting

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Aggregate node doesn't include cost for sorting
Date: 2022-12-08 09:06:22
Message-ID: 2691f881-d8de-dd57-5793-381d7d13e3cc@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

The cost of an Aggregate node seems to be the same regardless of the
input being pre-sorted or not. If however the input is not sorted, the
Aggregate node must additionally perform a sort which can impact runtime
significantly. Here is an example:

CREATE TABLE foo(col0 INT, col1 TEXT);
INSERT INTO foo SELECT generate_series(1, 100000),
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' || md5(random()::TEXT);
CREATE INDEX foo_idx ON foo(col1);
SET max_parallel_workers_per_gather = 0;
SET enable_bitmapscan = FALSE;

-- Unsorted input. Aggregate node must additionally sort all rows.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2584.00..2584.01 rows=1 width=8) (actual
time=1684.705..1684.809 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..2334.00 rows=100000 width=71)
(actual time=0.018..343.280 rows=100000 loops=1)
 Planning Time: 0.317 ms
 Execution Time: 1685.543 ms

-- Pre-sorted input. Aggregate node doesn't have to sort all rows.
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6756.42..6756.43 rows=1 width=8) (actual
time=819.028..819.041 rows=1 loops=1)
   ->  Index Only Scan using foo_idx on foo (cost=6506.42..6506.42
rows=100000 width=71) (actual time=0.046..404.260 rows=100000 loops=1)
         Heap Fetches: 100000
         Heap Prefetches: 1334
 Planning Time: 0.438 ms
 Execution Time: 819.515 ms

The cost of the Aggregate node is in both cases the same (250.0) while
its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case.

Also, why does the Aggregate node sort itself? Why don't we instead emit
an explicit Sort node in the plan so that it's visible to the user what
is happening? As soon as there's also a GROUP BY in the query, a Sort
node occurs in the plan. This seems inconsistent.

--
David Geier
(ServiceNow)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-12-08 09:12:45 Re: on placeholder entries in view rule action query's range table
Previous Message Masahiko Sawada 2022-12-08 08:43:14 Re: Perform streaming logical transactions by background workers and parallel apply