parallel queries hang

From: Euler Taveira <euler(at)timbira(dot)com(dot)br>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: parallel queries hang
Date: 2018-10-18 14:42:31
Message-ID: CAHE3wggOccFrWsNTnMQqPDDc1C=J=oqkoupDdJmqFyiHy58npg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

Parallel queries hang and can't terminate backends (SIGTERM). It seems
the same (or related) case as reported in [1]. It is a postgres 10.4
on RHEL 6.9. It is not the first time it happens (but it is the first
time I got stack traces). I have two cases today.

The first case is waiting on IPC (wait_event_type) for BtreePage (wait_event).

postgres 15785 0.0 0.0 6670688 733652 ? Ss Oct17 0:02 \_
postgres: usuario prod 10.0.0.2(38065) SELECT
postgres 16664 0.0 0.0 6667696 747816 ? Ss Oct17 0:02 \_
postgres: bgworker: parallel worker for PID 15785

The stack trace for PID 15785:

(gdb) bt
#0 0x00000030ac6e91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000006f6b65 in WaitEventSetWaitBlock (set=0xdcc680,
timeout=-1, occurred_events=0x7ffd0ceed320, nevents=1,
wait_event_info=<value optimized out>) at latch.c:1048
#2 WaitEventSetWait (set=0xdcc680, timeout=-1,
occurred_events=0x7ffd0ceed320, nevents=1, wait_event_info=<value
optimized out>) at latch.c:1000
#3 0x000000000070f207 in ConditionVariableSleep (cv=<value optimized
out>, wait_event_info=134217730) at condition_variable.c:149
#4 0x00000000004bffe1 in _bt_parallel_seize (scan=<value optimized
out>, pageno=0x7ffd0ceed3ec) at nbtree.c:811
#5 0x00000000004c1ffe in _bt_readnextpage (scan=0xfe2da8, blkno=0,
dir=ForwardScanDirection) at nbtsearch.c:1498
#6 0x00000000004c21fe in _bt_steppage (scan=0xfe2da8,
dir=ForwardScanDirection) at nbtsearch.c:1429
#7 0x00000000004c2384 in _bt_next (scan=0xfe2da8, dir=<value
optimized out>) at nbtsearch.c:1150
#8 0x00000000004c06fc in btgettuple (scan=0xfe2da8,
dir=ForwardScanDirection) at nbtree.c:399
#9 0x00000000004b9c9a in index_getnext_tid (scan=0xfe2da8,
direction=<value optimized out>) at indexam.c:541
#10 0x00000000004b9d4a in index_getnext (scan=0xfe2da8,
direction=ForwardScanDirection) at indexam.c:679
#11 0x0000000000606b89 in IndexNext (node=0xf07f58) at nodeIndexscan.c:134
#12 0x00000000005f5ee7 in ExecScanFetch (node=0xf07f58,
accessMtd=0x606b20 <IndexNext>, recheckMtd=0x606ac0 <IndexRecheck>) at
execScan.c:97
#13 ExecScan (node=0xf07f58, accessMtd=0x606b20 <IndexNext>,
recheckMtd=0x606ac0 <IndexRecheck>) at execScan.c:147
#14 0x000000000060eccc in ExecProcNode (pstate=0xf07d98) at
../../../src/include/executor/executor.h:250
#15 ExecNestLoop (pstate=0xf07d98) at nodeNestloop.c:109
#16 0x000000000060eccc in ExecProcNode (pstate=0xf07bd8) at
../../../src/include/executor/executor.h:250
#17 ExecNestLoop (pstate=0xf07bd8) at nodeNestloop.c:109
#18 0x0000000000602cff in ExecProcNode (pstate=0xf07858) at
../../../src/include/executor/executor.h:250
#19 gather_getnext (pstate=0xf07858) at nodeGather.c:286
#20 ExecGather (pstate=0xf07858) at nodeGather.c:215
#21 0x00000000005f333f in ExecProcNode (queryDesc=0xe1d578,
direction=<value optimized out>, count=0, execute_once=88 'X')
at ../../../src/include/executor/executor.h:250
#22 ExecutePlan (queryDesc=0xe1d578, direction=<value optimized out>,
count=0, execute_once=88 'X') at execMain.c:1722
#23 standard_ExecutorRun (queryDesc=0xe1d578, direction=<value
optimized out>, count=0, execute_once=88 'X') at execMain.c:363
#24 0x000000000071a77b in PortalRunSelect (portal=0xe320d8,
forward=<value optimized out>, count=0, dest=<value optimized out>) at
pquery.c:932
#25 0x000000000071b8e1 in PortalRun (portal=0xe320d8,
count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001',
dest=0xf746f0,
altdest=0xf746f0, completionTag=0x7ffd0ceed950 "") at pquery.c:773
#26 0x0000000000718097 in exec_simple_query (
query_string=0xdb0c18 "query goes here"...) at postgres.c:1099
#27 0x0000000000718fd9 in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, dbname=0xd076d8 "prod",
username=<value optimized out>) at postgres.c:4088
#28 0x00000000006ae04d in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:4405
#29 BackendStartup (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:4077
#30 ServerLoop (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:1755
#31 PostmasterMain (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:1363
#32 0x000000000062f050 in main (argc=3, argv=0xd05540) at main.c:228

The stack trace for PID 16664:

(gdb) bt
#0 0x00000030ac6e91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000006f6b65 in WaitEventSetWaitBlock (set=0xde4598,
timeout=-1, occurred_events=0x7ffd0ceecd90, nevents=1,
wait_event_info=<value optimized out>) at latch.c:1048
#2 WaitEventSetWait (set=0xde4598, timeout=-1,
occurred_events=0x7ffd0ceecd90, nevents=1, wait_event_info=<value
optimized out>) at latch.c:1000
#3 0x000000000070f207 in ConditionVariableSleep (cv=<value optimized
out>, wait_event_info=134217730) at condition_variable.c:149
#4 0x00000000004bffe1 in _bt_parallel_seize (scan=<value optimized
out>, pageno=0x7ffd0ceece5c) at nbtree.c:811
#5 0x00000000004c1ffe in _bt_readnextpage (scan=0xe0c8c0, blkno=0,
dir=ForwardScanDirection) at nbtsearch.c:1498
#6 0x00000000004c21fe in _bt_steppage (scan=0xe0c8c0,
dir=ForwardScanDirection) at nbtsearch.c:1429
#7 0x00000000004c2384 in _bt_next (scan=0xe0c8c0, dir=<value
optimized out>) at nbtsearch.c:1150
#8 0x00000000004c06fc in btgettuple (scan=0xe0c8c0,
dir=ForwardScanDirection) at nbtree.c:399
#9 0x00000000004b9c9a in index_getnext_tid (scan=0xe0c8c0,
direction=<value optimized out>) at indexam.c:541
#10 0x00000000004b9d4a in index_getnext (scan=0xe0c8c0,
direction=ForwardScanDirection) at indexam.c:679
#11 0x0000000000606b89 in IndexNext (node=0xddefe8) at nodeIndexscan.c:134
#12 0x00000000005f5ee7 in ExecScanFetch (node=0xddefe8,
accessMtd=0x606b20 <IndexNext>, recheckMtd=0x606ac0 <IndexRecheck>) at
execScan.c:97
#13 ExecScan (node=0xddefe8, accessMtd=0x606b20 <IndexNext>,
recheckMtd=0x606ac0 <IndexRecheck>) at execScan.c:147
#14 0x000000000060eccc in ExecProcNode (pstate=0xdded28) at
../../../src/include/executor/executor.h:250
#15 ExecNestLoop (pstate=0xdded28) at nodeNestloop.c:109
#16 0x000000000060eccc in ExecProcNode (pstate=0xddee98) at
../../../src/include/executor/executor.h:250
#17 ExecNestLoop (pstate=0xddee98) at nodeNestloop.c:109
#18 0x00000000005f333f in ExecProcNode (queryDesc=0xe0c7a0,
direction=<value optimized out>, count=0, execute_once=-104 '\230')
at ../../../src/include/executor/executor.h:250
#19 ExecutePlan (queryDesc=0xe0c7a0, direction=<value optimized out>,
count=0, execute_once=-104 '\230') at execMain.c:1722
#20 standard_ExecutorRun (queryDesc=0xe0c7a0, direction=<value
optimized out>, count=0, execute_once=-104 '\230') at execMain.c:363
#21 0x00000000005f3809 in ParallelQueryMain (seg=<value optimized
out>, toc=0x7f80a66e3000) at execParallel.c:997
#22 0x00000000004dac49 in ParallelWorkerMain (main_arg=<value
optimized out>) at parallel.c:1203
#23 0x000000000069f8f0 in StartBackgroundWorker () at bgworker.c:835
#24 0x00000000006a8aec in do_start_bgworker () at postmaster.c:5728
#25 maybe_start_bgworkers () at postmaster.c:5941
#26 0x00000000006ac115 in sigusr1_handler (postgres_signal_arg=<value
optimized out>) at postmaster.c:5121
#27 <signal handler called>
#28 0x00000030ac6e1603 in __select_nocancel () from /lib64/libc.so.6
#29 0x00000000006ad65d in ServerLoop (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1719
#30 PostmasterMain (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:1363
#31 0x000000000062f050 in main (argc=3, argv=0xd05540) at main.c:228

The second case is waiting on IPC (wait_event_type) for
BgWorkerShutdown (wait_event):

postgres 16811 0.0 0.0 6671376 493104 ? Ss Oct17 0:01 \_
postgres: ana rdnet 10.0.32.200(38793) SELECT
postgres 17877 0.0 0.0 6667696 311972 ? Ss Oct17 0:00 \_
postgres: bgworker: parallel worker for PID 16811

The stack trace for PID 16811:

(gdb) bt
#0 0x00000030ac6e91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000006f6b65 in WaitEventSetWaitBlock (set=0xda2fa8,
timeout=-1, occurred_events=0x7ffd0ceed8e0, nevents=1,
wait_event_info=<value optimized out>) at latch.c:1048
#2 WaitEventSetWait (set=0xda2fa8, timeout=-1,
occurred_events=0x7ffd0ceed8e0, nevents=1, wait_event_info=<value
optimized out>) at latch.c:1000
#3 0x00000000006f73c9 in WaitLatchOrSocket (latch=0x7f82325ab0b4,
wakeEvents=<value optimized out>, sock=-1, timeout=-1,
wait_event_info=134217728)
at latch.c:385
#4 0x000000000069eff8 in WaitForBackgroundWorkerShutdown
(handle=0xd24f70) at bgworker.c:1154
#5 0x00000000004db11e in WaitForParallelWorkersToExit (pcxt=0xd24c78)
at parallel.c:655
#6 0x00000000004db248 in DestroyParallelContext (pcxt=0xd24c78) at
parallel.c:737
#7 0x00000000004db2e3 in AtEOXact_Parallel (isCommit=0 '\000') at
parallel.c:1006
#8 0x00000000004e4537 in AbortTransaction () at xact.c:2538
#9 0x00000000004e4c05 in AbortCurrentTransaction () at xact.c:3046
#10 0x000000000071971d in PostgresMain (argc=1, argv=<value optimized
out>, dbname=0xd076d8 "rdnet", username=<value optimized out>)
at postgres.c:3879
#11 0x00000000006ae04d in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:4405
#12 BackendStartup (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:4077
#13 ServerLoop (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:1755
#14 PostmasterMain (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:1363
#15 0x000000000062f050 in main (argc=3, argv=0xd05540) at main.c:228

The stack trace for PID 17877:

(gdb) bt
#0 0x00000030ac6e91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000006f6b65 in WaitEventSetWaitBlock (set=0xde4968,
timeout=-1, occurred_events=0x7ffd0ceecd90, nevents=1,
wait_event_info=<value optimized out>) at latch.c:1048
#2 WaitEventSetWait (set=0xde4968, timeout=-1,
occurred_events=0x7ffd0ceecd90, nevents=1, wait_event_info=<value
optimized out>) at latch.c:1000
#3 0x000000000070f207 in ConditionVariableSleep (cv=<value optimized
out>, wait_event_info=134217730) at condition_variable.c:149
#4 0x00000000004bffe1 in _bt_parallel_seize (scan=<value optimized
out>, pageno=0x7ffd0ceece5c) at nbtree.c:811
#5 0x00000000004c1ffe in _bt_readnextpage (scan=0xe0cc90, blkno=0,
dir=ForwardScanDirection) at nbtsearch.c:1498
#6 0x00000000004c21fe in _bt_steppage (scan=0xe0cc90,
dir=ForwardScanDirection) at nbtsearch.c:1429
#7 0x00000000004c2384 in _bt_next (scan=0xe0cc90, dir=<value
optimized out>) at nbtsearch.c:1150
#8 0x00000000004c06fc in btgettuple (scan=0xe0cc90,
dir=ForwardScanDirection) at nbtree.c:399
#9 0x00000000004b9c9a in index_getnext_tid (scan=0xe0cc90,
direction=<value optimized out>) at indexam.c:541
#10 0x00000000004b9d4a in index_getnext (scan=0xe0cc90,
direction=ForwardScanDirection) at indexam.c:679
#11 0x0000000000606b89 in IndexNext (node=0xddf3b8) at nodeIndexscan.c:134
#12 0x00000000005f5ee7 in ExecScanFetch (node=0xddf3b8,
accessMtd=0x606b20 <IndexNext>, recheckMtd=0x606ac0 <IndexRecheck>) at
execScan.c:97
#13 ExecScan (node=0xddf3b8, accessMtd=0x606b20 <IndexNext>,
recheckMtd=0x606ac0 <IndexRecheck>) at execScan.c:147
#14 0x000000000060eccc in ExecProcNode (pstate=0xddf0f8) at
../../../src/include/executor/executor.h:250
#15 ExecNestLoop (pstate=0xddf0f8) at nodeNestloop.c:109
#16 0x000000000060eccc in ExecProcNode (pstate=0xddf268) at
../../../src/include/executor/executor.h:250
#17 ExecNestLoop (pstate=0xddf268) at nodeNestloop.c:109
#18 0x00000000005f333f in ExecProcNode (queryDesc=0xe0cb70,
direction=<value optimized out>, count=0, execute_once=104 'h')
at ../../../src/include/executor/executor.h:250
#19 ExecutePlan (queryDesc=0xe0cb70, direction=<value optimized out>,
count=0, execute_once=104 'h') at execMain.c:1722
#20 standard_ExecutorRun (queryDesc=0xe0cb70, direction=<value
optimized out>, count=0, execute_once=104 'h') at execMain.c:363
#21 0x00000000005f3809 in ParallelQueryMain (seg=<value optimized
out>, toc=0x7f80a66e3000) at execParallel.c:997
#22 0x00000000004dac49 in ParallelWorkerMain (main_arg=<value
optimized out>) at parallel.c:1203
#23 0x000000000069f8f0 in StartBackgroundWorker () at bgworker.c:835
#24 0x00000000006a8aec in do_start_bgworker () at postmaster.c:5728
#25 maybe_start_bgworkers () at postmaster.c:5941
#26 0x00000000006ac115 in sigusr1_handler (postgres_signal_arg=<value
optimized out>) at postmaster.c:5121
#27 <signal handler called>
#28 0x00000030ac6e1603 in __select_nocancel () from /lib64/libc.so.6
#29 0x00000000006ad65d in ServerLoop (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1719
#30 PostmasterMain (argc=<value optimized out>, argv=<value optimized
out>) at postmaster.c:1363
#31 0x000000000062f050 in main (argc=3, argv=0xd05540) at main.c:228

The non-cancellable queries is not an expected behavior because you
have to shutdown (immediate) postgres.

[1] https://www.postgresql.org/message-id/151724453314.1238.409882538067070269%40wrigleys.postgresql.org

--
Euler Taveira Timbira -
http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergei Kornilov 2018-10-18 14:51:26 Re: parallel queries hang
Previous Message PG Bug reporting form 2018-10-18 11:57:41 BUG #15437: Segfault during insert into declarative partitioned table with a trigger creating partition