Re: Adjusting hash join memory limit to handle batch explosion

From: Ben Mejia <benjamin(dot)arthur(dot)mejia(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2026-06-24 23:37:19
Message-ID: 9a3604ae-b36d-492d-a74b-a3f57cdfd5ae@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/31/24 3:06 PM, Tomas Vondra wrote:
> Hi,
>
> I've been once again reminded of the batch explosion issue in hashjoin,

...

>
> It's much worse when there's a batch that is not "divisible", i.e.
> adding more batches does not split it roughly in half. This can happen
> due to hash collisions (in the part that determines the batch),
> duplicate values that didn't make it into MCV (and thus the skew
> optimization does not kick in).
>
> This is fairly rare, but when it happens it can easily lead to batch
> explosion, i.e. rapidly increasing the number of batches. We add
> batches, but the batch does not split, so we promptly hit the limit
> again, triggering another increase. It often stops only when we exhaust
> the 32-bit hash space, ending with 100s of thousands of batches.

...

> * It actually helps with the "indivisible batch" case - it relaxes the
> limit, so there's a chance the batch eventually fits and we stop adding
> more and more batches. With spill files that's not the case - we still
> keep the original limit, and we end up with the batch explosion (but
> then we handle it much more efficiently).
The "indivisible batch" case is when all tuples end up in the same
batch. This is directly analogous to having all the tuples in one hash
bucket. In Postgres hash joins we're simply using different bits of the
tuple's hash value to select batch and bucket.

This can be addressed by the relax-the-limit approach Tomas describes in
patch (1) of <7bed6c08-72a0-4ab9-a79c-e01fcdd0940f(at)vondra(dot)me>, committed
for PG18 as a1b4f28.

But this can only do so much. In fact there's the comment in nodeHash.c:

/*
* If we dumped out either all or none of the tuples in the table, disable
* further expansion of nbatch. This situation implies that we have
* enough tuples of identical hashvalues to overflow spaceAllowed.
* Increasing nbatch will not fix it since there's no way to subdivide the
* group any more finely. We have to just gut it out and hope the server
* has enough RAM.
*/

I tried the "loop over the inner side" as did Melanie in
<CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA(at)mail(dot)gmail(dot)com>.
I got correct results but the runtime was worse than the gut-it-out
approach.

There is another thing to try: a secondary hash. A secondary hash splits
the batch when the tuples have distinct keys that collide in the hash
(the common case). It can't split a batch that is one repeated key:
equal keys hash equally. There we still fall back to the gut-it-out
path. The point is that today we treat both cases identically, when only
the duplicate-key case is truly unsplittable.

(See Raghu Ramakrishnan and Johannes Gehrke. Database Management
Systems. 3rd edition. McGraw-Hill, 2003. ISBN 0-07-246563-8. Chapter 14,
"Evaluation of Relational Operators". Page 465)

See <CAALR2u8WUfX0d+xJuf_12sgDZ29zTf_pWtTb2ACZL9en+_4q_g(at)mail(dot)gmail(dot)com>
for more information.

Ben Mejia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2026-06-24 23:43:28 Re: Small patch to improve safety of utf8_to_unicode().
Previous Message Masahiko Sawada 2026-06-24 23:29:22 Re: Support UUIDv6 in uuid_extract_timestamp()