From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Vaibhav Jain <jainva(at)google(dot)com>, pgsql-hackers(at)postgresql(dot)org, Madhukar <madhukarprasad(at)google(dot)com>, Sangeetha Seshadri <sangsesh(at)google(dot)com> |
Subject: | Re: Fix overflow of nbatch |
Date: | 2025-09-24 23:02:28 |
Message-ID: | 244dc6c1-3b3d-4de2-b3de-b1511e6a6d10@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Here's a couple draft patches fixing the bug:
- 0001 adds the missing size_t cast, to fix the overflow
- 0002 fixes the balancing, by adjusting the hash table size limit
- 0003 adds the missing overflow protection for nbuckets and the hash
table limit
- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.
0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).
At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]. It'd not need the loop, but it'd still need all this
new overflow protection etc.
I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:
create table t (a bigint, b text);
insert into t select i, md5(i::text)
from generate_series(1,4000000000) s(i);
and I added this log message at the end of ExecChooseHashTableSize:
elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
work_mem, nbatch, nbuckets, (*space_allowed)/1024,
(*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);
and I ran an explain on a self-join
set enable_mergejoin = off;
set max_parallel_workers_per_gather = 0;
set work_mem = '...';
explain select * from t t1 join t t2 on (t1.a = t2.a);
with work_mem set to values between 64kB and 1GB.
On 18.0 I got this:
wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:
wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.
With the attached patches, this looks like this:
wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
So, this time it actually seems to work correctly and significantly
reduces the memory usage ...
There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.
The reason is that when calculating nbatch we do this:
dbatch = Min(dbatch, max_pointers);
and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.
AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.
regards
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
vfix-0004-reword-balancing-comment.patch | text/x-patch | 5.1 KB |
vfix-0003-nbuckets-overflow-protection.patch | text/x-patch | 2.5 KB |
vfix-0002-adjust-hash_table_bytes-for-balancing.patch | text/x-patch | 695 bytes |
vfix-0001-fix-size-overflow.patch | text/x-patch | 1.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-09-24 23:22:01 | Re: DSA overflow in hash join |
Previous Message | Tom Lane | 2025-09-24 22:03:19 | Re: plan shape work |