From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Vaibhav Jain <jainva(at)google(dot)com>, pgsql-hackers(at)postgresql(dot)org, Madhukar <madhukarprasad(at)google(dot)com>, Sangeetha Seshadri <sangsesh(at)google(dot)com> |
Subject: | Re: Fix overflow of nbatch |
Date: | 2025-10-07 23:51:50 |
Message-ID: | 8754c898-1607-4db2-aaeb-7554665a3f6d@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/7/25 23:10, Melanie Plageman wrote:
> On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> Here's a couple draft patches fixing the bug:
>>
>> - 0001 adds the missing size_t cast, to fix the overflow
>> - 0002 fixes the balancing, by adjusting the hash table size limit
>> - 0003 adds the missing overflow protection for nbuckets and the hash
>> table limit
>
> I've attached an alternative patch idea. It's shorter because I
> avoided multiplication using algebraic equivalence. There are two
> overflow checks for nbuckets and that max_pointers will be able to be
> allocated, but otherwise, it avoids most of the overflow risk by
> switching to division.
>
Thanks for looking!
> On the topic of max_pointers, I think there is no need to handle >
> MaxAllocSize. We were willing to cope with the original values of
> nbatch and nbuckets, we are just trying to optimize that. If doing so
> would make us run out of memory for the arrays of poitners to buckets,
> just use the last hashtable size that didn't have this problem. That
> means we don't have to do any shenanigans to have powers of two for
> nbuckets because we don't do clamping.
>
Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...
However, I don't understand why the condition bothers about INT_MAX?
/* Ensure that nbuckets * 2 doesn't overflow an int */
if (nbuckets > INT_MAX / 2)
break;
AFAIK what matters is whether the buckets fit into MaxAllocSize. But
INT_MAX (or even INT_MAX/2) would allocate way more than that. The
original code (copied from an earlier part of the function) points out
the INT_MAX check is redundant, given the current MaxAllocSize value.
It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:
max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:
nbuckets = Max(nbuckets, 1024);
What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?
Similarly, why does this check care about number of buckets?
if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
break;
I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.
I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?
if (nbatch / 4 < hash_table_bytes / BLCKSZ)
break;
If we start with the "full" condition
hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
subtract hash_bytes from both sides
(2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
subtract (nbatch * BLCKSZ)
nbatch * BLCKSZ < hash_bytes
that gives us
nbatch < (hash_bytes / BLCKSZ)
So where did the /4 come from? Also, maybe it'd be easier to just do
(nbatch * BLCKSZ) < hash_bytes
i.e. without the division.
> Also, I think the outer loop needs the condition nbatch > 1 not nbatch
>> 0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.
>
Good point. I was wondering why we're not seeing any failures because of
this (e.g. from tests using tiny amounts of data, with a single batch).
But we immediately stop the condition, because the two batch files use
much less than the minimum work_mem value (even without the multiplier).
So it's benign, but let's not do that anyway.
> I'm still waiting for the 4billion row table to be created on my
> machine, so I haven't verified that I get the same results as you yet.
>
>> - 0004 rewords the comment explaining how the balancing works. Reading
>> it after a couple months, I found it overly verbose / not very clear.
>> I'm sure it could be improved even further.
>
> My attached patch does build on your revised wording.
>
Thanks. I'll re-read the comment tomorrow.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Mark Dilger | 2025-10-07 23:55:53 | Re: Qual push down to table AM |
Previous Message | Tatsuo Ishii | 2025-10-07 23:50:16 | Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options |