Re: DSA overflow in hash join

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, 4abe0c31-097e-460e-a159-c8042b63160b(at)garret(dot)ru
Subject: Re: DSA overflow in hash join
Date: 2025-08-14 19:34:24
Message-ID: c53dd06f-3e3b-4b7b-b3b7-65170019850e@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31/07/2025 18:13, Konstantin Knizhnik wrote:
>
> On 27/07/2025 8:24 PM, Konstantin Knizhnik wrote:
>>
>> I still trying to understand the reason of DSA overflow in hash join.
>> In addition to two suspicious places where number of buckets is
>> doubled without chek for overflow (nodeHash.c:1668 and nodeHash.c:3290),
>> there is one  more place  where number of batches is multiplied by
>> `EstimateParallelHashJoinBatch(hashtable)` which is
>>
>> sizeof(ParallelHashJoinBatch) + (sizeof(SharedTuplestore)  +
>> sizeof(SharedTuplestoreParticipant) * participants) * 2
>>
>> which is 480 bytes!
>>
>> But when we calculate maximal number of batches, we limit it by
>> macximal number of pointers (8 bytes):
>>
>>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>>     max_pointers = Min(max_pointers, MaxAllocSize /
>> sizeof(HashJoinTuple));
>>     /* If max_pointers isn't a power of 2, must round it down to one */
>>     max_pointers = pg_prevpower2_size_t(max_pointers);
>>
>>     /* Also ensure we avoid integer overflow in nbatch and nbuckets */
>>     /* (this step is redundant given the current value of MaxAllocSize) */
>>     max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
>>
>>     dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
>>     dbuckets = Min(dbuckets, max_pointers);
>>     nbuckets = (int) dbuckets;
>>
>>
>> But as we see, here multiplier is 480 bytes, not 8 bytes.
>>
>
> Below is script to reproduce the problem:
>
> CREATE TABLE IF NOT EXISTS t0(c0 FLOAT, PRIMARY KEY(c0)) WITH
> (parallel_workers=966);
> CREATE TABLE t2(c0 DECIMAL, c1 int4range ) WITH (parallel_workers=393);
> CREATE TABLE t4(LIKE t2);
> CREATE TABLE t5(LIKE t0);
> INSERT INTO t4(c0) VALUES(0.5934077416223362);
>
> set work_mem='10MB';
> set max_parallel_workers_per_gather=5;
>
> explain SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count
> FROM ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON
> (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM ONLY
> t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) as res;
>
> SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count FROM ONLY
> t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON
> (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM ONLY
> t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) as res;

Great repro, thanks!

> And attached please find patch fixing the issue.

There are a lot of allocations in hash join. For example this in
ExecParallelHashEnsureBatchAccessors():

/* Allocate this backend's accessor array. */
hashtable->nbatch = pstate->nbatch;
hashtable->batches =
palloc0_array(ParallelHashJoinBatchAccessor, hashtable->nbatch);

That could also exceed MaxAllocSize, right? I think we need to take that
into account in calculating the max number of batches, too.

Is this only a problem for parallel hash joins?

In general, it's really hard to follow the logic of where we enforce
what limits in ExecChooseHashTableSize(). How can we be sure that we
accounted for all allocations? Could we have more (static) assertions or
something, to ensure we've covered everything?

A different approach might be to make ExecChooseHashTableSize() return
just estimates, and enforce the hard limits later. But that could lead
to worse decisions: it's good for ExecChooseHashTableSize() to take the
maximum limits into account.

On the proposed patch:

> @@ -775,6 +777,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
> /* If max_pointers isn't a power of 2, must round it down to one */
> max_pointers = pg_prevpower2_size_t(max_pointers);
>
> + /*
> + * Prevent DSA overflow. This is expanded definition of EstimateParallelHashJoinBatch function
> + * used in ExecParallelHashJoinSetUpBatches:
> + * dsa_allocate0(hashtable->area,
> + * EstimateParallelHashJoinBatch(hashtable) * nbatch)
> + */
> + max_batches = MaxAllocSize / (MAXALIGN(sizeof(ParallelHashJoinBatch)) +
> + MAXALIGN(sts_estimate(max_parallel_workers_per_gather + 1) * 2));
> + max_batches = pg_prevpower2_size_t(max_batches);
> +
> /* Also ensure we avoid integer overflow in nbatch and nbuckets */
> /* (this step is redundant given the current value of MaxAllocSize) */
> max_pointers = Min(max_pointers, INT_MAX / 2 + 1);

Would be nice to not copy the macro here..

BTW I don't like the name EstimateParallelHashJoinBatch(). It's not just
an estimate, it's used to allocate memory for an array, and it better be
accurate. See also NthParallelHashJoinBatch.

> @@ -910,7 +923,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
> * this during the initial sizing - once we start building the hash,
> * nbucket is fixed.
> */
> - while (nbatch > 0)
> + while (nbatch > 0 &&
> + nbuckets * 2 <= max_pointers) /* prevent allocation limit overflow */

Hmm, that doesn't seem quite right. 'max_pointers' is derived from
work_mem, but the point of this loop is to reduce memory usage, when
you're already over the work_mem limit. I think we should use a hard
limit derived only from MaxAllocSize here.

> {
> /* how much memory are we using with current nbatch value */
> size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);

Hmm, should we add the memory usage calculated by
EstimateParallelHashJoinBatch() to this per-batch overhead, too?

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-08-14 19:34:32 Re: index prefetching
Previous Message Greg Burd 2025-08-14 19:32:09 Re: [PATCH] bms_prev_member() can read beyond the end of the array of allocated words