Quick Links

Re: Fix overflow of nbatch

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Vaibhav Jain <jainva(at)google(dot)com>, pgsql-hackers(at)postgresql(dot)org, Madhukar <madhukarprasad(at)google(dot)com>, Sangeetha Seshadri <sangsesh(at)google(dot)com>
Subject:	Re: Fix overflow of nbatch
Date:	2025-10-08 17:37:35
Message-ID:	CAAKRu_bm1FnOSsXTy8jpG6=+syEqxrgMSjAVEaSsBUSUdGaEkA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 10/7/25 23:10, Melanie Plageman wrote:
>
> Hmm, so if I understand correctly you suggest stop when nbuckets gets
> too high, while my code allowed reducing nbatches further (and just
> capped nbatches). I'm fine with this change, if it makes the code
> simpler, that means we allow ~130M buckets, which seems rather unlikely
> to be a problem in practice. That means 1GB for buckets alone, and
> tuples tend to be a multiple of that. With 4GB total, that's ~256k
> batches. And multiplied by the 130M that would be 3.5e13 tuples ...
>
> However, I don't understand why the condition bothers about INT_MAX?
>
> /* Ensure that nbuckets * 2 doesn't overflow an int */
> if (nbuckets > INT_MAX / 2)
> break;

You're right. This can just be
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

> It's however true that if we double nbuckets and nbatch at the same
> time, we don't need to bother doing this:
>
> max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>
> because we can never break. Although, is that really true? When picking
> the initial nbuckets value we do this:
>
> nbuckets = Max(nbuckets, 1024);
>
> What if this increased the nbuckets (it'd require extremely low work_mem
> of course)? Could it lead to unexpectedly high nbuckets in the loop?

If we are worried about nbuckets exceeding what can be allocated, then
I think the proposed condition above takes care of that

if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;

As for whether or not bumping nbuckets to 1024 before the loop means
we can have very high values with low work_mem: it seems like even
with the minimum work_mem, the number of buckets is larger than that.
When could we hit this? Maybe if the number of skew buckets is very
large?

> Similarly, why does this check care about number of buckets?
>
> if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
> break;
>
> I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
> (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
> size of the hash table in bytes, not in number of items. Also, see how
> get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
> continue doing that.

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

but I don't think we need this because nbuckets should always be
bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
out with, it was clamped to Max(1024, hash_table_bytes /
sizeof(HashJoinTuple)).

The only exception would be if MaxAllocSize / sizeof(HashJoinTuple)
was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
cover that anyway.

> I like the idea of simplifying the stop condition, instead of
> calculating the old/new size, and comparing those. But is this actually
> correct?
>
> if (nbatch / 4 < hash_table_bytes / BLCKSZ)
> break;
>
> If we start with the "full" condition
>
> hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
>
> subtract hash_bytes from both sides
>
> (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
>
> subtract (nbatch * BLCKSZ)
>
> nbatch * BLCKSZ < hash_bytes
>
> that gives us
>
> nbatch < (hash_bytes / BLCKSZ)
>
> So where did the /4 come from?

Yes, I made the mistake of doubling the batches and hash_table_bytes
again, forgetting that the original formula was already comparing the
hypothetical space to the current space.

I think it should be
if (nbatch < hash_table_bytes / BLCKSZ)
as you say

> Also, maybe it'd be easier to just do
> (nbatch * BLCKSZ) < hash_bytes
>
> i.e. without the division.

I prefer the division to avoid as many potentially overflow causing
operations as possible. otherwise we would have to check that nbatch *
BLCKSZ doesn't overflow first.

> > I'm still waiting for the 4billion row table to be created on my
> > machine, so I haven't verified that I get the same results as you yet.

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

Oh, and I get the same logging output results as your patch with attached v2.

- Melanie

Attachment	Content-Type	Size
v2-0001-Fix-overflow-risk-in-hashtable-size-calculation.patch	text/x-patch	7.5 KB

In response to

Re: Fix overflow of nbatch at 2025-10-07 23:51:50 from Tomas Vondra

Responses

Re: Fix overflow of nbatch at 2025-10-08 19:16:06 from Melanie Plageman
Re: Fix overflow of nbatch at 2025-10-08 20:46:34 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ants Aasma	2025-10-08 17:38:53	Re: sync_standbys_defined and pg_stat_replication
Previous Message	Andres Freund	2025-10-08 17:37:22	Re: another autovacuum scheduling thread