| From: | Ben Mejia <benjamin(dot)arthur(dot)mejia(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Cc: | tomas(dot)vondra(at)enterprisedb(dot)com, melanieplageman(at)gmail(dot)com |
| Subject: | [RFC] Secondary hash to split stuck hash-join batches |
| Date: | 2026-05-31 19:41:13 |
| Message-ID: | CAALR2u8WUfX0d+xJuf_12sgDZ29zTf_pWtTb2ACZL9en+_4q_g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Tomas clearly describes the underlying problem in [1]: stuck batches cannot
be split by batch doubling. This occurs when the tuples' hash values all
map to the same batch, similar to all items ending up in a single hash
bucket.
The HashJoin executor currently faces two unappealing fallback options:
a) Continually doubling nBatch in a futile attempt to split the
unsplittable batch.
b) Inflating spaceAllowed, which silently exceeds work_mem.
While PG18 caps the first fallback to prevent runaway batch explosions, it
attempts to manage the situation by incrementally increasing memory usage.
Consequently, work_mem is routinely violated by HashJoin, particularly when
encountering these unsplittable batches.
I have been experimenting with a rehash-based approach to reduce the memory
used by hash joins and would appreciate feedback on the direction before
polishing it for submission.
Sketch of Rehash Approach
- When ExecHashIncreaseNumBatches detects a stuck batch, call the new
function, ExecHashRehashBatch to attempt a rehash.
- Rehash attempts to reassign each tuple in the stuck batch to a sub-batch
using a secondary hash function.
- Sub-batches are handled in the same way as normal batches, and by the
same code.
- Sub-batches are split, like normal batches until the sub-batch fits into
spaceAllowed
- If the rehash cannot split the batch, it fails and the current
inflate-and-gut-it-out path is taken.
In my prototype, I use a GUC, enable_hashjoin_rehash to turn on the rehash
logic; otherwise the code works the same as before.
Additional Heuristics
Having the rehash available as a tool to control memory usage, I also
explored further modifications:
- Using planner estimates to detect skew vs. large tuples:
This can avoid doing a batch split and immediately attempt a rehash.
- Avoiding PG18's memory inflation:
Based on skew detection above, avoids inflating spaceAllowed and
attempts a rehash.
- Diminishing returns detection:
If batch splitting does less than a split-threshold attempt a rehash.
Some Preliminary Results
For the table:
CREATE TABLE t_stuck_4m (id INT, pad TEXT);
INSERT INTO t_stuck_4m
SELECT a, repeat(md5(a::text), 5)
FROM generate_series(1, 1000000000) s(a)
WHERE hashint4(a)::bit(32) & x'FFFFE000'::bit(32) = 0::bit(32);
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 2x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 4x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 8x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 16x
SET work_mem = '4MB';
SET enable_hashjoin_rehash = off;
EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
* FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Limit (actual rows=1.00 loops=1)
-> Hash Join (actual rows=1.00 loops=1)
Hash Cond: (a.id = b.id)
-> Seq Scan on t_stuck_4m a (actual rows=1.00 loops=1)
-> Hash (actual rows=29248.00 loops=1)
Buckets: 32768 (originally 32768) Batches: 4 (originally
2) Memory Usage: 5974kB
-> Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
Planning Time: 0.243 ms
Execution Time: 10.288 ms
SET enable_hashjoin_rehash = on;
EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
* FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Limit (actual rows=1.00 loops=1)
-> Hash Join (actual rows=1.00 loops=1)
Hash Cond: (a.id = b.id)
-> Seq Scan on t_stuck_4m a (actual rows=4.00 loops=1)
-> Hash (actual rows=29248.00 loops=1)
Buckets: 32768 (originally 32768) Batches: 4 (originally
2) Memory Usage: 3083kB
Rehash Repartitions: 1 (max sub-batches: 2)
-> Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
Planning Time: 0.110 ms
Execution Time: 10.071 ms
So the rehash drops memory usage from 5974 kB -> 3083 kB, under the 4MB
limit.
My observations:
1) The runtime when rehashing is used is nearly the same.
2) This execution split the one unsplittable batch into 2, reducing
total memory used.
I have more data and the patch to share if people are interested.
Best regards,
Ben Mejia
[1]
https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Scott Ray | 2026-05-31 21:11:07 | Re: [PATCH] Don't call ereport(ERROR) from recovery target GUC assign hooks |
| Previous Message | Lukas Fittl | 2026-05-31 19:01:40 | Unify parallel worker handling for index builds and instrumentation |