[RFC] Secondary hash to split stuck hash-join batches

From: Ben Mejia <benjamin(dot)arthur(dot)mejia(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: tomas(dot)vondra(at)enterprisedb(dot)com, melanieplageman(at)gmail(dot)com
Subject: [RFC] Secondary hash to split stuck hash-join batches
Date: 2026-05-31 19:41:13
Message-ID: CAALR2u8WUfX0d+xJuf_12sgDZ29zTf_pWtTb2ACZL9en+_4q_g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Tomas clearly describes the underlying problem in [1]: stuck batches cannot
be split by batch doubling. This occurs when the tuples' hash values all
map to the same batch, similar to all items ending up in a single hash
bucket.

The HashJoin executor currently faces two unappealing fallback options:

a) Continually doubling nBatch in a futile attempt to split the
unsplittable batch.

b) Inflating spaceAllowed, which silently exceeds work_mem.

While PG18 caps the first fallback to prevent runaway batch explosions, it
attempts to manage the situation by incrementally increasing memory usage.
Consequently, work_mem is routinely violated by HashJoin, particularly when
encountering these unsplittable batches.

I have been experimenting with a rehash-based approach to reduce the memory
used by hash joins and would appreciate feedback on the direction before
polishing it for submission.

Sketch of Rehash Approach

- When ExecHashIncreaseNumBatches detects a stuck batch, call the new
function, ExecHashRehashBatch to attempt a rehash.

- Rehash attempts to reassign each tuple in the stuck batch to a sub-batch
using a secondary hash function.

- Sub-batches are handled in the same way as normal batches, and by the
same code.

- Sub-batches are split, like normal batches until the sub-batch fits into
spaceAllowed

- If the rehash cannot split the batch, it fails and the current
inflate-and-gut-it-out path is taken.

In my prototype, I use a GUC, enable_hashjoin_rehash to turn on the rehash
logic; otherwise the code works the same as before.

Additional Heuristics

Having the rehash available as a tool to control memory usage, I also
explored further modifications:

- Using planner estimates to detect skew vs. large tuples:
This can avoid doing a batch split and immediately attempt a rehash.

- Avoiding PG18's memory inflation:
Based on skew detection above, avoids inflating spaceAllowed and
attempts a rehash.

- Diminishing returns detection:
If batch splitting does less than a split-threshold attempt a rehash.

Some Preliminary Results

For the table:

CREATE TABLE t_stuck_4m (id INT, pad TEXT);

INSERT INTO t_stuck_4m
SELECT a, repeat(md5(a::text), 5)
FROM generate_series(1, 1000000000) s(a)
WHERE hashint4(a)::bit(32) & x'FFFFE000'::bit(32) = 0::bit(32);

INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 2x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 4x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 8x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 16x

SET work_mem = '4MB';
SET enable_hashjoin_rehash = off;

EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
* FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
QUERY PLAN

--------------------------------------------------------------------------------------------------
Limit (actual rows=1.00 loops=1)
-> Hash Join (actual rows=1.00 loops=1)
Hash Cond: (a.id = b.id)
-> Seq Scan on t_stuck_4m a (actual rows=1.00 loops=1)
-> Hash (actual rows=29248.00 loops=1)
Buckets: 32768 (originally 32768) Batches: 4 (originally
2) Memory Usage: 5974kB
-> Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
Planning Time: 0.243 ms
Execution Time: 10.288 ms

SET enable_hashjoin_rehash = on;
EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
* FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
QUERY PLAN

--------------------------------------------------------------------------------------------------
Limit (actual rows=1.00 loops=1)
-> Hash Join (actual rows=1.00 loops=1)
Hash Cond: (a.id = b.id)
-> Seq Scan on t_stuck_4m a (actual rows=4.00 loops=1)
-> Hash (actual rows=29248.00 loops=1)
Buckets: 32768 (originally 32768) Batches: 4 (originally
2) Memory Usage: 3083kB
Rehash Repartitions: 1 (max sub-batches: 2)
-> Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
Planning Time: 0.110 ms
Execution Time: 10.071 ms

So the rehash drops memory usage from 5974 kB -> 3083 kB, under the 4MB
limit.

My observations:

1) The runtime when rehashing is used is nearly the same.

2) This execution split the one unsplittable batch into 2, reducing
total memory used.

I have more data and the patch to share if people are interested.

Best regards,
Ben Mejia

[1]
https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me

Browse pgsql-hackers by date

  From Date Subject
Next Message Scott Ray 2026-05-31 21:11:07 Re: [PATCH] Don't call ereport(ERROR) from recovery target GUC assign hooks
Previous Message Lukas Fittl 2026-05-31 19:01:40 Unify parallel worker handling for index builds and instrumentation