BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: jtc331(at)gmail(dot)com
Subject: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-11-08 21:52:16
Message-ID: 16104-dc11ed911f1ab9df@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16104
Logged by: James Coleman
Email address: jtc331(at)gmail(dot)com
PostgreSQL version: 11.5
Operating system: Debian
Description:

We have a query that, after a recent logical migration to 11.5, ends up with
a parallel hash join (I don't think the query plan/query itself is important
here, but if needed after the rest of the explanation, I can try to redact
it for posting). The query results in this error:

ERROR: invalid DSA memory alloc request size 1375731712

(the size changes sometimes significantly, but always over a GB)

At first glance it sounded eerily similar to this report which preceded the
final release of 11.0:
https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
but I confirmed that the patch for that bug was applied and is in 11.5 (and
earlier).

We managed to reproduce this on a replica, and so were able to attach gdb in
production to capture a backtrace:

#0 errfinish (dummy=dummy(at)entry=0) at
./build/../src/backend/utils/error/elog.c:423
#1 0x000055a7c0a00f79 in elog_finish (elevel=elevel(at)entry=20,
fmt=fmt(at)entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
./build/../src/backend/utils/error/elog.c:1385
#2 0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
size=1140850688, flags=flags(at)entry=4) at
./build/../src/backend/utils/mmgr/dsa.c:677
#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
(hashtable=hashtable(at)entry=0x55a7c1db2740, nbatch=nbatch(at)entry=2097152) at
./build/../src/backend/executor/nodeHash.c:2889
#4 0x000055a7c079e5f9 in ExecParallelHashIncreaseNumBatches
(hashtable=0x55a7c1db2740) at
./build/../src/backend/executor/nodeHash.c:1122
#5 0x000055a7c079ef6e in ExecParallelHashTuplePrealloc (size=56,
batchno=<optimized out>, hashtable=0x55a7c1db2740) at
./build/../src/backend/executor/nodeHash.c:3283
#6 ExecParallelHashTableInsert (hashtable=hashtable(at)entry=0x55a7c1db2740,
slot=slot(at)entry=0x55a7c1dadc90, hashvalue=<optimized out>) at
./build/../src/backend/executor/nodeHash.c:1716
#7 0x000055a7c079f17f in MultiExecParallelHash (node=0x55a7c1dacb78) at
./build/../src/backend/executor/nodeHash.c:288
#8 MultiExecHash (node=node(at)entry=0x55a7c1dacb78) at
./build/../src/backend/executor/nodeHash.c:112
#9 0x000055a7c078c40c in MultiExecProcNode (node=node(at)entry=0x55a7c1dacb78)
at ./build/../src/backend/executor/execProcnode.c:501
#10 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized
out>) at ./build/../src/backend/executor/nodeHashjoin.c:290
#11 ExecParallelHashJoin (pstate=<optimized out>) at
./build/../src/backend/executor/nodeHashjoin.c:581
#12 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d7b018) at
./build/../src/backend/executor/execProcnode.c:461
#13 0x000055a7c079f142 in ExecProcNode (node=0x55a7c1d7b018) at
./build/../src/include/executor/executor.h:251
#14 MultiExecParallelHash (node=0x55a7c1d759d0) at
./build/../src/backend/executor/nodeHash.c:281
#15 MultiExecHash (node=node(at)entry=0x55a7c1d759d0) at
./build/../src/backend/executor/nodeHash.c:112
#16 0x000055a7c078c40c in MultiExecProcNode (node=node(at)entry=0x55a7c1d759d0)
at ./build/../src/backend/executor/execProcnode.c:501
#17 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized
out>) at ./build/../src/backend/executor/nodeHashjoin.c:290
#18 ExecParallelHashJoin (pstate=<optimized out>) at
./build/../src/backend/executor/nodeHashjoin.c:581
#19 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d74e60) at
./build/../src/backend/executor/execProcnode.c:461
#20 0x000055a7c0784303 in ExecProcNode (node=0x55a7c1d74e60) at
./build/../src/include/executor/executor.h:251
#21 ExecutePlan (execute_once=<optimized out>, dest=0x55a7c1d0be00,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x55a7c1d74e60, estate=0x55a7c1d74b70) at
./build/../src/backend/executor/execMain.c:1640
#22 standard_ExecutorRun (queryDesc=0x55a7c1d5dcd0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:369
#23 0x00007f4b8b9ace85 in pgss_ExecutorRun (queryDesc=0x55a7c1d5dcd0,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
./build/../contrib/pg_stat_statements/pg_stat_statements.c:892
#24 0x000055a7c07893d1 in ParallelQueryMain (seg=0x55a7c1caa648,
toc=<optimized out>) at
./build/../src/backend/executor/execParallel.c:1401
#25 0x000055a7c064ee64 in ParallelWorkerMain (main_arg=<optimized out>) at
./build/../src/backend/access/transam/parallel.c:1409
#26 0x000055a7c08568ed in StartBackgroundWorker () at
./build/../src/backend/postmaster/bgworker.c:834
#27 0x000055a7c08637b5 in do_start_bgworker (rw=0x55a7c1c98200) at
./build/../src/backend/postmaster/postmaster.c:5722
#28 maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:5935
#29 0x000055a7c0864355 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at ./build/../src/backend/postmaster/postmaster.c:5096
#30 <signal handler called>
#31 0x00007f4b915895e3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#32 0x000055a7c05d8b5d in ServerLoop () at
./build/../src/backend/postmaster/postmaster.c:1671
#33 0x000055a7c08654f1 in PostmasterMain (argc=5, argv=0x55a7c1c73e50) at
./build/../src/backend/postmaster/postmaster.c:1380
#34 0x000055a7c05dac34 in main (argc=5, argv=0x55a7c1c73e50) at
./build/../src/backend/main/main.c:228

From what I can tell it seems that src/backend/executor/nodeHash.c:2888
(looking at the 11.5 release tag) is another entry point into similar
potential problems as were guarded against in the patch I mentioned earlier,
and that this is another way parallel hash nodes can end up attempting to
allocate more memory than is allowed.

Thanks,
James Coleman

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2019-11-08 23:30:44 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message Alessandro Ferraresi 2019-11-08 19:37:04 Re: BUG #16098: unexplained autovacuum to prevent wraparound