Hash aggregate collisions cause excessive spilling

From: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Hash aggregate collisions cause excessive spilling
Date: 2026-02-19 14:49:44
Message-ID: CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Investigating a customer complaint I ran into an issue with the hash
aggregate code. The problem was that a query that usually completes in
less than a minute sometimes gets stuck indefinitely (hours+). I
tracked it down to a hash aggregate node returning one tuple from a
batch and spilling the rest.

The reason for the behavior is that aggstate->hash_metacxt was 100M,
which is larger than work_mem*hash_mem_multiplier of 64M. This makes
hash_agg_check_limits() always spill after the first tuple. I think
that ends up having a n² overhead, with n being almost 4M here.

I don't have a simple reproducer yet, because the live problem was on
a parallel query where looking at the backend wrong caused the problem
to disappear. After some retries I was able to catch an instance of
growing past work_mem with gdb. After that growth the simplehash was
{size = 4194304, members = 409839, ..}, i.e. the table was only 20%
full before growing. So the cause seems to be a run of hash collisions
bigger than SH_GROW_MAX_MOVE (150).

AFAICT there is nothing in simplehash that would stop it growing past
work_mem, and once it does the spilling logic in
agg_refill_hash_table() enters this degenerate state until the end of
the plan node.

I think the correct fix would be to have a way to insert into
simplehash with a limit on size, which means that the insert might
fail. I haven't yet looked at how complicated this would be to
implement.

I also haven't checked what is the cause for such a long run of
collisions. But I think it's related to it being a HashAggregate on
top of Gather on top of HashAggregate.

Regards,
Ants Aasma

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2026-02-19 14:57:24 Re: [PATCH] Expose checkpoint reason to completion log messages.
Previous Message Daniel Gustafsson 2026-02-19 14:22:52 Re: Make PGOAUTHCAFILE in libpq-oauth work out of debug mode