Re: Hash aggregate collisions cause excessive spilling

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash aggregate collisions cause excessive spilling
Date: 2026-02-19 15:33:22
Message-ID: a5f46e2e-3b11-44bd-9a24-fa7d5b3948cd@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/19/26 15:49, Ants Aasma wrote:
> Investigating a customer complaint I ran into an issue with the hash
> aggregate code. The problem was that a query that usually completes in
> less than a minute sometimes gets stuck indefinitely (hours+). I
> tracked it down to a hash aggregate node returning one tuple from a
> batch and spilling the rest.
>
> The reason for the behavior is that aggstate->hash_metacxt was 100M,
> which is larger than work_mem*hash_mem_multiplier of 64M. This makes
> hash_agg_check_limits() always spill after the first tuple. I think
> that ends up having a n² overhead, with n being almost 4M here.
>

Not great :-(

> I don't have a simple reproducer yet, because the live problem was on
> a parallel query where looking at the backend wrong caused the problem
> to disappear. After some retries I was able to catch an instance of
> growing past work_mem with gdb. After that growth the simplehash was
> {size = 4194304, members = 409839, ..}, i.e. the table was only 20%
> full before growing. So the cause seems to be a run of hash collisions
> bigger than SH_GROW_MAX_MOVE (150).
>
> AFAICT there is nothing in simplehash that would stop it growing past
> work_mem, and once it does the spilling logic in
> agg_refill_hash_table() enters this degenerate state until the end of
> the plan node.
>
> I think the correct fix would be to have a way to insert into
> simplehash with a limit on size, which means that the insert might
> fail. I haven't yet looked at how complicated this would be to
> implement.
>

Wouldn't it be easier to just start ignoring SH_GROW_MAX_MOVE? That'd
have a little bit of performance impact on that one key, but that seems
acceptable. And easier to do than dealing with failing inserts.

> I also haven't checked what is the cause for such a long run of
> collisions. But I think it's related to it being a HashAggregate on
> top of Gather on top of HashAggregate.
>

So it's a parallel aggregate? Partial + Finalize? I wonder if that might
be "correlating" the data in a way that makes it more likely to hit
SH_GROW_MAX_MOVE. But If that was the case, wouldn't we see this issue
more often?

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matheus Alcantara 2026-02-19 15:38:06 psql: Add verbose option (+) for \dX command
Previous Message Fujii Masao 2026-02-19 14:57:24 Re: [PATCH] Expose checkpoint reason to completion log messages.