From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
---|---|
To: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Aggregate node doesn't include cost for sorting |
Date: | 2022-12-08 09:06:22 |
Message-ID: | 2691f881-d8de-dd57-5793-381d7d13e3cc@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi hackers,
The cost of an Aggregate node seems to be the same regardless of the
input being pre-sorted or not. If however the input is not sorted, the
Aggregate node must additionally perform a sort which can impact runtime
significantly. Here is an example:
CREATE TABLE foo(col0 INT, col1 TEXT);
INSERT INTO foo SELECT generate_series(1, 100000),
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' || md5(random()::TEXT);
CREATE INDEX foo_idx ON foo(col1);
SET max_parallel_workers_per_gather = 0;
SET enable_bitmapscan = FALSE;
-- Unsorted input. Aggregate node must additionally sort all rows.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2584.00..2584.01 rows=1 width=8) (actual
time=1684.705..1684.809 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..2334.00 rows=100000 width=71)
(actual time=0.018..343.280 rows=100000 loops=1)
Planning Time: 0.317 ms
Execution Time: 1685.543 ms
-- Pre-sorted input. Aggregate node doesn't have to sort all rows.
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6756.42..6756.43 rows=1 width=8) (actual
time=819.028..819.041 rows=1 loops=1)
-> Index Only Scan using foo_idx on foo (cost=6506.42..6506.42
rows=100000 width=71) (actual time=0.046..404.260 rows=100000 loops=1)
Heap Fetches: 100000
Heap Prefetches: 1334
Planning Time: 0.438 ms
Execution Time: 819.515 ms
The cost of the Aggregate node is in both cases the same (250.0) while
its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case.
Also, why does the Aggregate node sort itself? Why don't we instead emit
an explicit Sort node in the plan so that it's visible to the user what
is happening? As soon as there's also a GROUP BY in the query, a Sort
node occurs in the plan. This seems inconsistent.
--
David Geier
(ServiceNow)
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2022-12-08 09:12:45 | Re: on placeholder entries in view rule action query's range table |
Previous Message | Masahiko Sawada | 2022-12-08 08:43:14 | Re: Perform streaming logical transactions by background workers and parallel apply |