Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17

From: Scott Carey <scott(dot)carey(at)algonomy(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
Date: 2026-04-01 07:04:49
Message-ID: CA+vubOEbfm5+BxtuFGb_tTj3sSX1u0Y+=zHUxoF1Q9feGpRfNg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Mar 31, 2026 at 12:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Scott Carey <scott(dot)carey(at)algonomy(dot)com> writes:
> >> On Tue, Mar 31, 2026 at 5:03 AM David Rowley <dgrowleyml(at)gmail(dot)com>
> wrote:
> >>> I tried and failed to recreate this locally on 17.9. For me the
> >>> json_agg query is slower than array_agg(). I tried making the table
> >>> 10x bigger and still don't see the same issue. The one with more
> >>> work_mem and fewer batches is always faster for me.
>
> >> I don't know what other differences there could be, other than OS. This
> >> reproduces for me on Linux with the above on a RHEL 9 clone (pg 17) or
> with
> >> Ubuntu 25.10 (pg 16) so I suspect it is not too picky about the distro
> used.
>
> Like David, I can't reproduce the described behavior. I tried on
> RHEL8/x86_64 and on macOS/M4, and got runtimes that barely vary
> across different work_mem settings, all sub-100ms. It should be
> noted that I tested v17 branch tip not precisely 17.9 --- but there's
> nothing in the commit log to suggest that we changed v17's behavior
> since February.
>
> One thing I find interesting is that your results show significantly
> more memory consumption as well as runtime. I had to add a run with
> work_mem = "200MB" to get the no-batching behavior you show at
> work_mem = "100MB", and then my results look like
>
>
The memory difference is strange.
I now have 6 systems that I have tested this on. One of them behaves just
like yours above, with the same memory usage and appropriate performance.
5 batches and 9009kB at work_mem = "100MB";

The other 5 all misbehave and have ~ 50x worse performance when there is
only one batch. (work_mem 1000MB). These use a little bit over 3x the
memory for the single batch.

> Some thoughts:
>
> * Does it repro without the "vector" extension? Seems unlikely that
> that is related, but we're at the grasping-at-straws stage.
>
>
Although I cannot remove the vector extension safely in production, I tried
adding the extension to the system that does not reproduce the problem, and
that did not trigger it.

> * More grasping at straws: is this stock community Postgres, or
> some vendor's modification (eg RDS or Aurora)?
>

This is from the pgdg repo for RHEL 9, from the postgresql.org website.

>
> * It would be worth doing the EXPLAINs with the SETTINGS option,
> just to make sure that there's not some non-default setting you
> forgot to mention.
>

I did not mention a few values that differ between the servers that
reproduce this, like autovacuum tuning parameters and
maintenance_work_men. adding settings to the explain gives a couple more,
unlikely to be related to the problem:

Settings: temp_buffers = '512MB', work_mem = '1000MB',
effective_io_concurrency = '16', effective_cache_size = '150GB'

While writing this, I decided to test out a few more vector extension test
cases, and discovered something new and mind boggling: :

On systems that reproduces the problem, if I create a new test database,
then test the query in that database, the problem does not occur.

e.g.

create database test;
\c test
.... run all the commands in my first email

In the 'test' database everything is fine. Queries are fast, memory use is
the same as yours. If I go back to the production database, the problem
occurs again.
One thing in common for the systems with the problem is that they had the
pgvector extension installed on them for a while, and have gone through
some pg_update cycles.

I have no idea where to go from here on identifying why one database would
behave like this but not the other -- on the same posrgres instance.
This _could_ still be the pgvector extension, or at least something to do
with using it through a pg_upgrade. The extension upgrade was executed
after pg_update, but maybe something is wrong with that for this extension.

Before discovering the above, my next plan was to set up linux perf and
capture some OS level profiling on one of the near idle read-only standbys
that show the problem, to see if there is something we can see there. I
haven't investigated how to get decent stacks from that, I assume
installing some debug packages from the repo would enable that but I have
done no research and haven't attempted to use `perf` on postgres before.

>
> regards, tom lane
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2026-04-01 07:18:22 Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
Previous Message David Rowley 2026-03-31 22:48:55 Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17