Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: jtc331(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-11-08 23:30:44
Message-ID: 20191108233044.awe5owizhowebyy5@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>The following bug has been logged on the website:
>
>Bug reference: 16104
>Logged by: James Coleman
>Email address: jtc331(at)gmail(dot)com
>PostgreSQL version: 11.5
>Operating system: Debian
>Description:
>
>We have a query that, after a recent logical migration to 11.5, ends up with
>a parallel hash join (I don't think the query plan/query itself is important
>here, but if needed after the rest of the explanation, I can try to redact
>it for posting). The query results in this error:
>
>ERROR: invalid DSA memory alloc request size 1375731712
>
>(the size changes sometimes significantly, but always over a GB)
>
>At first glance it sounded eerily similar to this report which preceded the
>final release of 11.0:
>https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
>but I confirmed that the patch for that bug was applied and is in 11.5 (and
>earlier).
>
>We managed to reproduce this on a replica, and so were able to attach gdb in
>production to capture a backtrace:
>
>#0 errfinish (dummy=dummy(at)entry=0) at
>./build/../src/backend/utils/error/elog.c:423
>#1 0x000055a7c0a00f79 in elog_finish (elevel=elevel(at)entry=20,
>fmt=fmt(at)entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
>./build/../src/backend/utils/error/elog.c:1385
>#2 0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
>size=1140850688, flags=flags(at)entry=4) at
>./build/../src/backend/utils/mmgr/dsa.c:677
>#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>(hashtable=hashtable(at)entry=0x55a7c1db2740, nbatch=nbatch(at)entry=2097152) at
>./build/../src/backend/executor/nodeHash.c:2889
> ...

I've briefly looked at this today, and I think the root cause is
somewhat similar to what is described in [1] where we simply increase
the number of batches in an effort to keep batch contents in work_mem,
but ignoring that each batch requires quite a bit of memory. So we end
up with a lot of batches where each is small enough to fit into
work_mem, but we need much more than work_mem to track the batches.

This seems to be about the same problem, except that instead of
forgeting about BufFile, the parallel hash join ignores this:

pstate->batches =
dsa_allocate0(hashtable->area,
EstimateParallelHashJoinBatch(hashtable) * nbatch);

Looking at the backtrace, you ended up with nbatch=2097152 and it failed
trying to allocate 1140850688 bytes, i.e. exactly 544B per batch.

The EstimateParallelHashJoinBatch combines a bunch of structs and does
depend on the number of participants, and by my rough calculation 544B
means ~13 participants. Any idea if that matches the number of parallel
workers for this query?

Can you try running the query with fewer workers per gather?

IMHO this has the same issue as the non-parallel case described in [1]
in that we don't really consider this memory when planning the query,
which means we may accept hash join for queries where we know the number
of batches will be too high during planning. We need to reject parallel
hash join in those cases.

Of course, I don't know if this is such case - perhaps this is a case of
underestimation and we only learn the actual number while executing the
query, at which point it's too late to reject PHJ. I suppose the right
solution in that case would be some form of BNL, as discussed in [2] but
I'm not sure how far are we from having that, or if that works for PHJ.

regards

[1] https://www.postgresql.org/message-id/flat/20190504003414(dot)bulcbnge3rhwhcsh(at)development

[2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message James Coleman 2019-11-09 00:23:25 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message PG Bug reporting form 2019-11-08 21:52:16 BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash