Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, James Coleman <jtc331(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-12-16 21:02:49
Message-ID: 20191216210249.ht2udaevrgyyzbmr@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
>On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>>><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>The annoying thing is that it's the workers that crash, and the leader
>>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>>>>forever. Not sure what the issue is.
>>>
>>>That's bad. Some edge case in
>>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>>>b0rked perhaps? Can you reproduce it, and tell us what the stack
>>>looks like, maybe on a new thread?
>>
>>Sure. I'll try reproducing it and get this extra info. I was planning to
>>do that, but then I lost the data for the report because of a silly
>>mistake. Early next week, hopefully.
>>
>
>I've tried reproducing this, but so far no luck. I've only managed to do
>two runs, though, because this particular machine only has space for the
>1TB of temp files on spinning rust. I wonder if it's possible to
>reproduce the issue with a smaller data set? I've tried using smaller
>work_mem, but then it does not trigger the issue :-(
>

And of course, the very next run after I send the "can't reproduce"
message actually reproduces the issue. How typical ...

So, there were ~8 workers while the query was running, with the leader
process being 29417. At the moment, the list of processes looks like
this:

bench2 ~ # ps ax | grep postgres
11877 pts/2 S 0:00 su - postgres
14082 ? Ss 12:38 postgres: parallel worker for PID 29417
14083 ? Ss 12:28 postgres: parallel worker for PID 29417
17334 pts/0 S+ 0:00 grep --colour=auto postgres
29408 ? Ss 0:01 /var/lib/postgresql/pg-master/bin/postgres -D /mnt/data/data-master
29410 ? Ss 0:00 postgres: checkpointer
29411 ? Ss 0:01 postgres: background writer
29412 ? Ss 0:01 postgres: walwriter
29413 ? Ss 0:02 postgres: autovacuum launcher
29414 ? Ss 0:14 postgres: stats collector
29415 ? Ss 0:00 postgres: logical replication launcher
29417 ? Ss 75:11 postgres: postgres test [local] EXPLAIN

and it failes like this:

2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032
2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032
2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker
2019-12-16 20:25:00.617 CET [29417] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.618 CET [14080] FATAL: terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14080] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.618 CET [14085] FATAL: terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14085] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.618 CET [14084] FATAL: terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14084] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.618 CET [14079] FATAL: terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14079] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.618 CET [14078] FATAL: terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14078] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1
2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1
2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1
2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1
2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1
2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1

which is ~2h ago. There's no activity at all (no I/O or anything like
that), the temp directory is ~1TB of allocated files:

bench2 /var/lib/postgresql # du -s /mnt/raid/pgsql_tmp/
1099760800 /mnt/raid/pgsql_tmp/

which means nothing got cleaned up. The backtraces for the remaining
backends look like this:

1) 29417 (i.e. the leader)

0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0 0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d887416d48, cur_timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1) at latch.c:1080
#2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d887416d48, timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1, wait_event_info=134217728) at latch.c:1032
#3 0x000055d8864d4565 in WaitLatchOrSocket (latch=0x7f64670ac194, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217728) at latch.c:407
#4 0x000055d8864d4421 in WaitLatch (latch=0x7f64670ac194, wakeEvents=17, timeout=0, wait_event_info=134217728) at latch.c:347
#5 0x000055d88644c18b in WaitForBackgroundWorkerShutdown (handle=0x55d88742abc8) at bgworker.c:1153
#6 0x000055d886147690 in WaitForParallelWorkersToExit (pcxt=0x55d88742a338) at parallel.c:842
#7 0x000055d8861478af in DestroyParallelContext (pcxt=0x55d88742a338) at parallel.c:924
#8 0x000055d886147feb in AtEOXact_Parallel (isCommit=false) at parallel.c:1197
#9 0x000055d886153774 in AbortTransaction () at xact.c:2654
#10 0x000055d886153f80 in AbortCurrentTransaction () at xact.c:3196
#11 0x000055d8865044e7 in PostgresMain (argc=1, argv=0x55d8873d41e8, dbname=0x55d8873d4118 "test", username=0x55d8873d40f8 "postgres") at postgres.c:4081
#12 0x000055d88645f0d6 in BackendRun (port=0x55d8873ca150) at postmaster.c:4498
#13 0x000055d88645e7fc in BackendStartup (port=0x55d8873ca150) at postmaster.c:4189
#14 0x000055d88645abd0 in ServerLoop () at postmaster.c:1727
#15 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#16 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

2) 14082

0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1) at latch.c:1080
#2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1, wait_event_info=134217742) at latch.c:1032
#3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186
#4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126
#5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
#6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
#7 0x000055d88632ca1f in ExecParallelHashTuplePrealloc (hashtable=0x55d887481e70, batchno=237, size=40) at nodeHash.c:3308
#8 0x000055d886329dfe in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10, hashvalue=994179508) at nodeHash.c:1725
#9 0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288
#10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111
#11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500
#12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291
#13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588
#14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460
#15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444
#16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240
#17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405
#18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757
#19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572
#20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460
#21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444
#22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240
#23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408, execute_once=true) at execMain.c:1646
#24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364
#25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308
#26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401
#27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430
#28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834
#29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca340) at postmaster.c:5839
#30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065
#31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234
#32 <signal handler called>
#33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e806050) at ../sysdeps/unix/sysv/linux/select.c:41
#34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691
#35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

3) 14083

0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805120, nevents=1) at latch.c:1080
#2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805120, nevents=1, wait_event_info=134217742) at latch.c:1032
#3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186
#4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126
#5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
#6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
#7 0x000055d88632b93c in ExecParallelHashTupleAlloc (hashtable=0x55d887481e70, size=40, shared=0x7ffc6e8052b8) at nodeHash.c:2805
#8 0x000055d886329d46 in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10, hashvalue=2006491) at nodeHash.c:1702
#9 0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288
#10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111
#11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500
#12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291
#13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588
#14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460
#15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444
#16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240
#17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405
#18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757
#19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572
#20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460
#21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444
#22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240
#23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408, execute_once=true) at execMain.c:1646
#24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364
#25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308
#26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401
#27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430
#28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834
#29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca590) at postmaster.c:5839
#30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065
#31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234
#32 <signal handler called>
#33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e806050) at ../sysdeps/unix/sysv/linux/select.c:41
#34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691
#35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

So the only difference between the two remaining worker processes is
frame #7, where 14082 calls ExecParallelHashTuplePrealloc while 14083
calls ExecParallelHashTupleAlloc.

I'll leave the system in this "stuck" space in case we need to do
further investigation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kaijiang Chen 2019-12-17 02:58:17 weird long time query
Previous Message Tom Lane 2019-12-16 20:17:13 Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)