Significant performance issues with array_agg() + HashAggregate plans on Postgres 17

From: Scott Carey <scott(dot)carey(at)algonomy(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
Date: 2026-03-31 09:28:59
Message-ID: CA+vubOGaanYaGGb98etJPeJCbN=RAawx652MdmJVRK8+BeEP0w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Problem Summary:

A simple aggregate using array_agg goes significantly faster and faster the
*smaller* the (work_mem * hash_mem_multiplier), with the same simple query
plan: HashAggregate over a sequential scan. Changing to a simple
aggregate, such as max() does not have this behavior and is always fast.
Switching to another aggregate that grows in size for each element, such as
json_agg or string_agg also does not have this behavior. If I add an order
by clause inside array_agg, performance significantly improves as it
changes from a HashAggregate of a sequential scan to a GroupAggregate over
a sort over a sequential scan. Something seems specifically broken with
array_agg + HashAggregate.

These queries are anywhere from 10x to 1000x slower on Postgres 17.9 than
they were on Postgres 12.19 on production data. Some of our OLTP queries
have gone from minutes to 6 hours to complete. I do not know if this
happens on Postgres 18, I can confirm it also happens on Postgres 16.8. I
do not know about 13 through 15.

Below is a simplified reproduction with a test table below:

show server_version;
server_version
----------------
17.9

create table array_agg_test(product_id bigint not null, region_id bigint
not null, available boolean not null);

insert into array_agg_test (product_id, region_id, available) SELECT
generate_series(1, 50000) as product_id,
(ARRAY[1,2,3,4])[floor(random()*4)+1] as region_id, true as available;
insert into array_agg_test (product_id, region_id, available) SELECT
generate_series(1, 50000) as product_id,
(ARRAY[11,12,13,14])[floor(random()*4)+1] as region_id, true as available;

insert into array_agg_test (product_id, region_id, available) SELECT
generate_series(1, 50000) as product_id,
(ARRAY[111,112,113,114])[floor(random()*4)+1] as region_id, true as
available;
insert into array_agg_test (product_id, region_id, available) SELECT
generate_series(1, 50000) as product_id,
(ARRAY[1111,1112,1113,1114])[floor(random()*4)+1] as region_id, true as
available;
vacuum analyze array_agg_test;

We now have a table with 200000 rows, 50000 distinct product_id with 4 rows
each, with a distinct region_id. It is simple enough that default
statistics are fine here; we want to trigger HashAgg over SeqScan anyway.
Other query plans that avoid HashAggregate don't have the issue -- index
scans, group aggregate are fine.

set hash_mem_multiplier = 2;
set work_mem = "100MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=4274.00..4771.49 rows=49749 width=40) (actual
time=4628.278..4643.765 rows=50000 loops=1)
Group Key: product_id
Batches: 1 Memory Usage: 55649kB
Buffers: shared hit=1274
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.030..16.694 rows=200000 loops=1)
Buffers: shared hit=1274
Planning Time: 0.067 ms
Execution Time: 4648.698 ms

set work_mem = "20MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual
time=2568.837..2672.140 rows=50000 loops=1)
Group Key: product_id
Planned Partitions: 4 Batches: 5 Memory Usage: 40954kB Disk Usage:
3352kB
Buffers: shared hit=1274, temp read=243 written=594
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.013..15.266 rows=200000 loops=1)
Buffers: shared hit=1274
Planning Time: 0.051 ms
Execution Time: 2674.329 ms

set work_mem = "10MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual
time=635.816..888.167 rows=50000 loops=1)
Group Key: product_id
Planned Partitions: 8 Batches: 9 Memory Usage: 20474kB Disk Usage:
7272kB
Buffers: shared hit=1274, temp read=566 written=1388
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.018..12.689 rows=200000 loops=1)
Buffers: shared hit=1274
Planning Time: 0.057 ms
Execution Time: 890.987 ms

set work_mem = "5MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual
time=172.948..341.847 rows=50000 loops=1)
Group Key: product_id
Planned Partitions: 16 Batches: 17 Memory Usage: 10234kB Disk Usage:
7080kB
Buffers: shared hit=1274, temp read=731 written=1553
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.010..11.715 rows=200000 loops=1)
Buffers: shared hit=1274
Planning Time: 0.064 ms
Execution Time: 344.248 ms

set work_mem = "1MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual
time=56.102..144.350 rows=50000 loops=1)
Group Key: product_id
Planned Partitions: 64 Batches: 65 Memory Usage: 2050kB Disk Usage:
12200kB
Buffers: shared hit=1274, temp read=892 written=2374
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.017..12.346 rows=200000 loops=1)
Buffers: shared hit=1274
Planning Time: 0.053 ms
Execution Time: 147.254 ms

Below this work_mem size it chooses a GroupAggregate and sorted scan

set hash_mem_multiplier = 20;

explain (analyze, buffers) select product_id, array_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual
time=654.729..890.816 rows=50000 loops=1)
Group Key: product_id
Planned Partitions: 8 Batches: 9 Memory Usage: 20480kB Disk Usage:
7264kB
Buffers: shared read=1274, temp read=561 written=1384
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.044..18.521 rows=200000 loops=1)
Buffers: shared read=1274
Planning Time: 0.067 ms
Execution Time: 893.208 ms

Note the performance is a function of hash_mem_multiplier * work_mem, as it
seems to be related to the number of Batches. The more batches the faster
it goes.

Below, note json_agg does not have this problem:

set hash_mem_multiplier = 2;
set work_mem = "500MB";

explain (analyze, buffers) select product_id, json_agg(region_id) from
array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=4274.00..4895.86 rows=49749 width=40) (actual
time=110.975..122.781 rows=50000 loops=1)
Group Key: product_id
Batches: 1 Memory Usage: 67089kB
Buffers: shared read=1274
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16)
(actual time=0.034..17.659 rows=200000 loops=1)
Buffers: shared read=1274
Planning:
Buffers: shared hit=10
Planning Time: 0.054 ms
Execution Time: 124.929 ms

and adding a useless 'order by' clause inside array_agg triggers a
GroupAggregate which is ok as well:

explain (analyze, buffers) select product_id, array_agg(region_id order by
available) from array_agg_test group by product_id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=20883.64..22881.13 rows=49749 width=40) (actual
time=40.013..73.385 rows=50000 loops=1)
Group Key: product_id
Buffers: shared hit=5 read=1274
-> Sort (cost=20883.64..21383.64 rows=200000 width=17) (actual
time=40.000..46.090 rows=200000 loops=1)
Sort Key: product_id, available
Sort Method: quicksort Memory: 13957kB
Buffers: shared hit=5 read=1274
-> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000
width=17) (actual time=0.033..17.911 rows=200000 loops=1)
Buffers: shared read=1274
Planning:
Buffers: shared hit=7
Planning Time: 0.063 ms
Execution Time: 74.823 ms

The "missing time" here is in between the end of the sequential scan, which
takes < 20ms, and the 'start' of the GroupAggregate, which in the worst
case example here is several seconds later.

I am fairly stuck here. I am looking at modifying client code to use
json_agg instead of array_agg where possible as a work-around, but ideally
that would not be needed, array_agg shouldn't be significantly different.

A secondary observation, related but not the issue at hand:
The row size estimate for the aggregate is always `width=40` here, no
matter how large the resulting arrays are expected to be. In extreme cases
this can lead to hash memory consumption that is far larger than
predicted. On postgres 12 a several years ago, I once saw a query with
work_mem 1000MB use up 290GB and crash the server as it was running
a complex json_agg across a large number of values per bucket and the query
planner did not expect to store json data so large per output row of the
aggregate. Disk backed aggregates now prevent the crash, but it would
probably help the query planner if array_agg (and other accumulating
aggregators like json_agg) could provide an output size estimate that is a
function of the number of expected elements aggregated over.

Thanks in advance for any help here!

Scott Carey

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Pavel Stehule 2026-03-31 11:07:34 Re: proposal: schema variables
Previous Message Pavel Stehule 2026-03-26 05:18:17 Re: proposal: schema variables