Re: Hash aggregate collisions cause excessive spilling

From: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash aggregate collisions cause excessive spilling
Date: 2026-03-01 17:19:38
Message-ID: CANwKhkM5ZWH4WMegLZ7s_RBzkTtN7CUouNiu291Xn-76JZwd_w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 28 Feb 2026 at 22:03, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > This problem should only affect cases where multiple hash tables worth
> > of tuples are trying to fit into one.
>
> Is that entirely true? If we are feeding one hash table from other hash table,
> 1:1, and the initial size of the target table is wrong, we probably also won't
> handle it in the best way.
>
> That could happen with a group by over a group by (with perhaps some joins
> above the first group by). Probably not that common, but ...
>
> I guess another case could be a CTAS of a group by query that is then grouped
> by again.

With 1:1 tables, if the target table is smaller there is going to be a
ton of collisions which should quite quickly trigger growth via
SH_GROW_MAX_MOVE (150). But once the target table is the same size as
source the growth will stop. I think this ends up being slightly more
efficient in the correlated case because the table will be growing at
10% fill factor, having less entries to move around. And the access to
the target hash table will be more correlated over time for better
cache efficiency.

However I think it might be possible to have different hash table size
limits in a single plan. At least setting hash_mem_multiplier on a
function would do it. But I wasn't able to trigger any bad behavior
with this for some reason. I will experiment some more.

> > I think right now that limits it to hash aggregates with parallel
> > query. ExecBuildHash32FromAttrs comment says that we should be using
> > non-zero init_value only when needed for a marginal performance gain. So I
> > limited the plan node id inclusion to use_variable_hash_iv, but made it
> > unconditional for aggregates.
>
> That's an argument... I guess we could mostly avoid that though, by just
> always combining in EEOP_HASHDATUM_FIRST.

I think that makes it exactly the same as EEOP_HASHDATUM_NEXT32,
making the special op code pointless. I assume David observed some
speedup there to go through the effort to include it.

>> > I used hash_combine to smear together the bits of the two. I think
> > that should be good enough. Alternatively a xor of murmurhash32 of the
> > two should be better.
>
> Probably it's good enough, but i'd probably still go for
> hash_combine(murmurhash32(ParallelWorkerNumber),
> murmurhash32(parent->plan->plan_node_id))
> or such.

Used this approach.

> > There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
> > that specifies a hash_iv for a hashed sub plan execution.
>
> When you're saying it specifies a hash IV you mean the 0 it passes, not some
> more complicated value, right?

Right.

>
> > This must match the value BuildTupleHashTable decides on. Worth adding a
> > comment?
>
> Yes, I think so.

Done.

> > */
> > if (use_variable_hash_iv)
> > - hash_iv = murmurhash32(ParallelWorkerNumber);
> > + hash_iv = hash_combine(ParallelWorkerNumber, parent->plan->plan_node_id);
> >
> > hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);
>
> I think some of the reasoning from your commit messages needs to be in a
> comment somewhere too, otherwise the next person tackling this code will have
> a hell of a time.

I gave it a shot.

> > diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
> > index 7d487a165fa..5c947fd151f 100644
> > --- a/src/backend/executor/nodeAgg.c
> > +++ b/src/backend/executor/nodeAgg.c
> > @@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double nbuckets)
> > metacxt,
> > tuplescxt,
> > tmpcxt,
> > - DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
> > + true);
> > }
>
> Hm, won't that mean we'll now always use the slower path? I think there are
> non-partial cases that could be problematic, but if we do this we probaly
> should do the optimization of the hash expression stuff I mentioned above.

If you mean EEOP_HASHDATUM_FIRST, then I think that would just be
removing an optimization. I think getting rid of the quicker growth
and correlated table filling in the 1:1 case might be more significant
than the slightly slower hash calculation. However the downside of the
bad case is pretty nasty. Do you think it's worth the effort of trying
to figure out if we are in a dangerous looking structure?

Decorrelating the hash tables still doesn't fix the terrible edge case
of hash aggregate exceeding memory limit with hash table alone. Just
makes it exceedingly improbable to hit with non-antagonistic
workloads. Would it be a good idea to fix that too? A simple answer
would be to always allow hash aggregate to use some fraction like 25%
of allowed memory for tuples, even if the hash table is too big.

Regards,
Ants Aasma

Attachment Content-Type Size
v2-0001-Decorrelate-nested-hash-tables.patch text/x-patch 3.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2026-03-01 19:40:24 Re: POC: PLpgSQL FOREACH IN JSON ARRAY
Previous Message Joel Jacobson 2026-03-01 17:10:10 [BUGFIX] Fix crash due to sizeof bug in RegisterExtensionExplainOption