BUG #15585: infinite DynamicSharedMemoryControlLock waiting in parallel query

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: sk(at)zsrv(dot)org
Subject: BUG #15585: infinite DynamicSharedMemoryControlLock waiting in parallel query
Date: 2019-01-10 09:21:24
Message-ID: 15585-324ff6a93a18da46@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15585
Logged by: Sergei Kornilov
Email address: sk(at)zsrv(dot)org
PostgreSQL version: 10.6
Operating system: Ubuntu 16.04.2 LTS
Description:

Hello
I have encountered issue with infinite parallel query: many processes
waiting LWLock DynamicSharedMemoryControlLock (client backend and several
background worker) and one client backend with IPC BgWorkerShutdown waiting
forever (some hours) and unkillable. Initially this issue was spotted 1 jan
on 10.5 version, then i shutdown database (immediate, i was unable shutdown
cleanly), install gdb, dbg packages and upgrade to 10.6. Yesterday i have
this issue again and save some backtraces (query was hang about 0.5 hour).
It seems a bit similar to BUG #15290 [1], but this server was already 10.5
and another LWLock.

Parallel leader:
datid | 16400
datname | db
pid | 28927
usesysid | 852050794
usename | user
application_name | ...
client_addr | 127.0.0.1
client_hostname |
client_port | 58326
backend_start | 2019-01-09 10:54:02.50649+00
xact_start | 2019-01-09 10:56:55.802265+00
query_start | 2019-01-09 10:56:55.80524+00
state_change | 2019-01-09 10:56:55.805243+00
wait_event_type | IPC
wait_event | BgWorkerShutdown
state | active
backend_xid |
backend_xmin | 2047764791
query | SELECT ...
backend_type | client backend

#0 0x00007f198e58b9f3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x0000555d7fe393d1 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffe64b3c320, cur_timeout=-1, set=0x555d8108fec8) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:1048
#2 WaitEventSetWait (set=set(at)entry=0x555d8108fec8,
timeout=timeout(at)entry=-1,
occurred_events=occurred_events(at)entry=0x7ffe64b3c320,
nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=134217728)
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:1000
#3 0x0000555d7fe39844 in WaitLatchOrSocket (latch=0x7f15ab5a80b4,
wakeEvents=wakeEvents(at)entry=17, sock=sock(at)entry=-1, timeout=-1,
timeout(at)entry=0, wait_event_info=wait_event_info(at)entry=134217728)
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:385
#4 0x0000555d7fe398f5 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents(at)entry=17, timeout=timeout(at)entry=0,
wait_event_info=wait_event_info(at)entry=134217728)
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/storage/ipc/latch.c:339
#5 0x0000555d7fddce3b in WaitForBackgroundWorkerShutdown
(handle=0x555d810b2db0) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/bgworker.c:1147
#6 0x0000555d7fc0211d in WaitForParallelWorkersToExit (pcxt=0x555d810b2938,
pcxt=0x555d810b2938) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/access/transam/parallel.c:659
#7 0x0000555d7fc02e47 in DestroyParallelContext (pcxt=0x555d810b2938) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/access/transam/parallel.c:741
#8 0x0000555d7fc03458 in AtEOXact_Parallel (isCommit=isCommit(at)entry=0
'\000') at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/access/transam/parallel.c:1010
#9 0x0000555d7fc0e807 in AbortTransaction () at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/access/transam/xact.c:2566
#10 0x0000555d7fc0ef65 in AbortCurrentTransaction () at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/access/transam/xact.c:3125
#11 0x0000555d7fe5c7cd in PostgresMain (argc=1,
argv=argv(at)entry=0x555d81009a08, dbname=0x555d810370f8 "shard_sw_production",
username=0x555d81037128 "startwire_background")
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/tcop/postgres.c:3908
#12 0x0000555d7fb93509 in BackendRun (port=0x555d81032d80) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4405
#13 BackendStartup (port=0x555d81032d80) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4077
#14 ServerLoop () at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1755
#15 0x0000555d7fdeb6bb in PostmasterMain (argc=5, argv=<optimized out>) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1363
#16 0x0000555d7fb94905 in main (argc=5, argv=0x555d81007850) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/main/main.c:228

One another locked process:

datid | 16400
datname | db
pid | 22070
usesysid | 852050794
usename | user
application_name | ...
client_addr | 127.0.0.1
client_hostname |
client_port | 57762
backend_start | 2019-01-09 10:16:45.773862+00
xact_start | 2019-01-09 10:57:38.902604+00
query_start | 2019-01-09 10:57:38.90572+00
state_change | 2019-01-09 10:57:38.905724+00
wait_event_type | LWLock
wait_event | DynamicSharedMemoryControlLock
state | active
backend_xid |
backend_xmin | 2047768195
query | SELECT ...
backend_type | client backend

#0 0x00007f199087a827 in futex_abstimed_wait_cancelable (private=128,
abstime=0x0, expected=0, futex_word=0x7f11d620c438) at
../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem(at)entry=0x7f11d620c438, abstime=0x0) at
sem_waitcommon.c:111
#2 0x00007f199087a8d4 in __new_sem_wait_slow (sem=0x7f11d620c438,
abstime=0x0) at sem_waitcommon.c:181
#3 0x00007f199087a97a in __new_sem_wait (sem=sem(at)entry=0x7f11d620c438) at
sem_wait.c:29
#4 0x0000555d7fdd9782 in PGSemaphoreLock (sema=0x7f11d620c438) at
pg_sema.c:316
#5 0x0000555d7fe4ca0c in LWLockAcquire (lock=0x7f11d620e600,
mode=mode(at)entry=LW_EXCLUSIVE) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/storage/lmgr/lwlock.c:1233
#6 0x0000555d7fe37717 in dsm_create (size=size(at)entry=380920,
flags=flags(at)entry=1) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/storage/ipc/dsm.c:493
#7 0x0000555d7fc0248f in InitializeParallelDSM
(pcxt=pcxt(at)entry=0x555d81388958) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/access/transam/parallel.c:268
#8 0x0000555d7fd266c7 in ExecInitParallelPlan (planstate=0x555d81105440,
estate=0x555d811047e8, nworkers=4) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/execParallel.c:470
#9 0x0000555d7fd36260 in ExecGather (pstate=0x555d81105120) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/nodeGather.c:158
#10 0x0000555d7fd2e70c in ExecProcNode (node=0x555d81105120) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/include/executor/executor.h:250
#11 fetch_input_tuple (aggstate=aggstate(at)entry=0x555d81104a78) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/nodeAgg.c:695
#12 0x0000555d7fd309cb in agg_retrieve_direct (aggstate=0x555d81104a78) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/nodeAgg.c:2347
#13 ExecAgg (pstate=0x555d81104a78) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/nodeAgg.c:2158
#14 0x0000555d7fd2250b in ExecProcNode (node=0x555d81104a78) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/include/executor/executor.h:250
#15 ExecutePlan (execute_once=<optimized out>, dest=0x555d810b5fc8,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x555d81104a78,
estate=0x555d811047e8) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x555d81111338, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/executor/execMain.c:364
#17 0x00007f1987338ea5 in pgss_ExecutorRun (queryDesc=0x555d81111338,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>)
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../contrib/pg_stat_statements/pg_stat_statements.c:889
#18 0x0000555d7fe5ebfb in PortalRunSelect
(portal=portal(at)entry=0x555d81026d48, forward=forward(at)entry=1 '\001',
count=0, count(at)entry=9223372036854775807, dest=dest(at)entry=0x555d810b5fc8)
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/tcop/pquery.c:932
#19 0x0000555d7fe60210 in PortalRun (portal=portal(at)entry=0x555d81026d48,
count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=1 '\001',
run_once=<optimized out>, dest=dest(at)entry=0x555d810b5fc8,
altdest=altdest(at)entry=0x555d810b5fc8, completionTag=0x7ffe64b3c780 "")
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/tcop/pquery.c:773
#20 0x0000555d7fe5cd89 in exec_execute_message
(max_rows=9223372036854775807, portal_name=0x555d810b5bb8 "") at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/tcop/postgres.c:2007
#21 PostgresMain (argc=<optimized out>, argv=argv(at)entry=0x555d81009a08,
dbname=0x555d810370f8 "shard_sw_production", username=<optimized out>)
at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/tcop/postgres.c:4180
#22 0x0000555d7fb93509 in BackendRun (port=0x555d81033e30) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4405
#23 BackendStartup (port=0x555d81033e30) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:4077
#24 ServerLoop () at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1755
#25 0x0000555d7fdeb6bb in PostmasterMain (argc=5, argv=<optimized out>) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/postmaster/postmaster.c:1363
#26 0x0000555d7fb94905 in main (argc=5, argv=0x555d81007850) at
/build/postgresql-10-4UiiBA/postgresql-10-10.6/build/../src/backend/main/main.c:228

Also i found relevant (and strange) messages in logs:

2019-01-09 10:56:55.947 UTC 31156 @ from [vxid:16/1632 txid:0] [] ERROR:
dsa_area could not attach to segment
2019-01-09 10:56:55.947 UTC 31156 @ from [vxid:16/1632 txid:0] []
STATEMENT: SELECT ...
2019-01-09 10:56:55.950 UTC 14433 @ from [vxid: txid:0] [] LOG: worker
process: parallel worker for PID 28927 (PID 31156) exited with exit code 1
2019-01-09 10:56:55.950 UTC 28927 user(at)db from 127.0.0.1 [vxid:102/913
txid:0] [SELECT] ERROR: dsa_area could not attach to segment
2019-01-09 10:56:55.950 UTC 28927 user(at)db from 127.0.0.1 [vxid:102/913
txid:0] [SELECT] STATEMENT: SELECT ...
2019-01-09 10:56:55.954 UTC 31157 @ from [vxid:32/1740 txid:0] [] FATAL:
terminating connection due to administrator command
2019-01-09 10:56:55.954 UTC 31157 @ from [vxid:32/1740 txid:0] []
STATEMENT: SELECT ...
2019-01-09 10:56:55.954 UTC 31157 @ from [vxid:32/0 txid:0] [] FATAL:
cannot unpin a segment that is not pinned

log_line_prefix = "%m %p %u(at)%d from %h [vxid:%v txid:%x] [%i]", without
another log messages +- 5 min.

Pid 28927 is parallel leader with BgWorkerShutdown, 31156 is parallel worker
for this query, 31157 seems another parallel worker for this query.
Unfortunally i notice it later and have no backtrace. But i have
pg_stat_activity row for this process:

datid | 16400
datname | db
pid | 22070
usesysid | 852050794
usename | user
application_name | ...
client_addr |
client_hostname |
client_port |
backend_start | 2019-01-09 10:56:55.839719+00
xact_start |
query_start | 2019-01-09 10:56:55.80524+00
state_change | 2019-01-09 10:56:55.840737+00
wait_event_type | LWLock
wait_event | DynamicSharedMemoryControlLock
state | active
backend_xid |
backend_xmin |
query | SELECT ... (same query as stuck leader)
backend_type | background worker

Postgresql installed from PGDG repo, no external modules are installed.
Queries uses one native partitioned table with Parallel Index Scan or
Parallel Bitmap Heap Scan. Query itself works ok, i can not reproduce this
issue by hand.
Same log output "dsa_area could not attach to segment" + "cannot unpin a
segment that is not pinned" was 1 jan.

Also it is expected have CHECK_FOR_INTERRUPTS in
WaitForBackgroundWorkerShutdown with only one caller in
DestroyParallelContext:
/*
* We can't finish transaction commit or abort until all of the workers
* have exited. This means, in particular, that we can't respond to
* interrupts at this stage.
*/
HOLD_INTERRUPTS();
WaitForParallelWorkersToExit(pcxt);
RESUME_INTERRUPTS();

Can parallel worker wait something response from leader?

thanks in advance!

regards, Sergei

[1]
https://www.postgresql.org/message-id/flat/153228422922.1395.1746424054206154747%40wrigleys.postgresql.org

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergei Kornilov 2019-01-10 09:44:58 Re: BUG #15582: ALTER TABLE/INDEX ... SET TABLESPACE does not free disk space
Previous Message Dean Rasheed 2019-01-10 09:18:09 Re: BUG #15446: Crash on ALTER TABLE