Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION
Date: 2017-10-24 17:13:28
Message-ID: eb6d6bac-1fda-2adc-7138-e8c11ee145ee@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Parallel execution of ALTER SUBSCRIPTION REFRESH PUBLICATION at several
nodes may cause deadlock:

knizhnik  1480  0.0  0.1 1417532 16496 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16589
sync 16720    waiting
knizhnik  1481  0.0  0.1 1417668 17668 ?       Ss   20:01   0:00
postgres: knizhnik postgres 127.0.0.1(36378) idle
knizhnik  1484  0.0  0.1 1406952 16676 ?       Ss   20:01   0:00
postgres: knizhnik postgres 127.0.0.1(56194) ALTER SUBSCRIPTION waiting
knizhnik  1486  0.0  0.1 1410040 21268 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(36386) idle
knizhnik  1487  0.0  0.1 1417532 16736 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16589
sync 16774
knizhnik  1489  0.0  0.1 1410040 21336 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(36388) idle
knizhnik  1510  0.0  0.1 1406952 16820 ?       Ss   20:01   0:00
postgres: knizhnik postgres 127.0.0.1(56228) ALTER SUBSCRIPTION waiting
knizhnik  1530  0.0  0.1 1406952 16736 ?       Ss   20:01   0:00
postgres: knizhnik postgres 127.0.0.1(56260) ALTER SUBSCRIPTION waiting
knizhnik  1534  0.0  0.0 1417504 14360 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16590
knizhnik  1537  0.0  0.1 1409164 16920 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(45054) idle
knizhnik  1545  0.0  0.0 1417344 14576 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16592
knizhnik  1546  0.0  0.1 1409168 16588 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(56274) idle
knizhnik  1547  0.0  0.0 1417348 14332 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16592
sync 16606
knizhnik  1549  0.0  0.0 1409004 14512 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(56276) idle in
transaction waiting
knizhnik  1553  0.0  0.0 1417348 14540 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16592
knizhnik  1554  0.0  0.1 1409168 16596 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(56280) idle
knizhnik  1555  0.0  0.0 1417348 14332 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16592
sync 16660
knizhnik  1556  0.0  0.0 1409004 14520 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(56282) idle in
transaction waiting
knizhnik  1558  0.0  0.0 1417348 14460 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16592
sync 16696
knizhnik  1560  0.0  0.0 1409004 14516 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(56286) idle in
transaction waiting
knizhnik  1562  0.0  0.0 1417348 14084 ?       Ss   20:01   0:00
postgres: bgworker: logical replication worker for subscription 16592
sync 16732
knizhnik  1567  0.0  0.0 1409004 14516 ?       Ss   20:01   0:00
postgres: wal sender process knizhnik 127.0.0.1(56288) idle in
transaction waiting

knizhnik(at)knizhnik:~$ gdb -p 1556
(gdb) bt
#0 0x00007f7b991569b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x000000000087185a in WaitEventSetWaitBlock (set=0x246d1e8, cur_timeout=-1, occurred_events=0x7ffff3a43580, nevents=1) at latch.c:1048
#2 0x0000000000871716 in WaitEventSetWait (set=0x246d1e8, timeout=-1, occurred_events=0x7ffff3a43580, nevents=1, wait_event_info=50331652) at latch.c:1000
#3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f7b975daba4, wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331652) at latch.c:385
#4 0x0000000000870d3e in WaitLatch (latch=0x7f7b975daba4, wakeEvents=1, timeout=0, wait_event_info=50331652) at latch.c:339
#5 0x000000000088995c in ProcSleep (locallock=0x24764d0, lockMethodTable=0xc17420 <default_lockmethod>) at proc.c:1255
#6 0x00000000008828cb in WaitOnLock (locallock=0x24764d0, owner=0x2517c78) at lock.c:1702
#7 0x000000000088157e in LockAcquireExtended (locktag=0x7ffff3a439a0, lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001') at lock.c:998
#8 0x0000000000880ba8 in LockAcquire (locktag=0x7ffff3a439a0, lockmode=5, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688
#9 0x000000000087f8dc in XactLockTableWait (xid=624, rel=0x0, ctid=0x0, oper=XLTW_None) at lmgr.c:587
#10 0x000000000082f524 in SnapBuildWaitSnapshot (running=0x246d1b0, cutoff=632) at snapbuild.c:1400
#11 0x000000000082f2f6 in SnapBuildFindSnapshot (builder=0x254b550, lsn=24410008, running=0x246d1b0) at snapbuild.c:1311
#12 0x000000000082ee67 in SnapBuildProcessRunningXacts (builder=0x254b550, lsn=24410008, running=0x246d1b0) at snapbuild.c:1106
#13 0x000000000081c458 in DecodeStandbyOp (ctx=0x25334b0, buf=0x7ffff3a43b10) at decode.c:314
#14 0x000000000081bf39 in LogicalDecodingProcessRecord (ctx=0x25334b0, record=0x2533770) at decode.c:117
#15 0x0000000000820219 in DecodingContextFindStartpoint (ctx=0x25334b0) at logical.c:458
#16 0x000000000083674a in CreateReplicationSlot (cmd=0x246c418) at walsender.c:934
#17 0x00000000008374ef in exec_replication_command (cmd_string=0x24dd140 "CREATE_REPLICATION_SLOT \"node_1_16592_sync_16660\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1511
#18 0x00000000008a47ff in PostgresMain (argc=1, argv=0x247c468, dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at postgres.c:4086
#19 0x0000000000801753 in BackendRun (port=0x2477db0) at postmaster.c:4357
#20 0x0000000000800e5f in BackendStartup (port=0x2477db0) at postmaster.c:4029
#21 0x00000000007fd398 in ServerLoop () at postmaster.c:1753
#22 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at postmaster.c:1361
#23 0x0000000000734f08 in main (argc=3, argv=0x244d6e0) at main.c:228

knizhnik(at)knizhnik:~$ gdb -p 1558
#0 0x00007f2c6edd19b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x000000000087185a in WaitEventSetWaitBlock (set=0x21d8ec8, cur_timeout=-1, occurred_events=0x7ffe17ed0c20, nevents=1) at latch.c:1048
#2 0x0000000000871716 in WaitEventSetWait (set=0x21d8ec8, timeout=-1, occurred_events=0x7ffe17ed0c20, nevents=1, wait_event_info=100663299) at latch.c:1000
#3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f2c6d2825a4, wakeEvents=19, sock=4, timeout=-1, wait_event_info=100663299) at latch.c:385
#4 0x00007f2c1a5bc304 in libpqrcv_PQexec (streamConn=0x21e0ca0, query=0x21d9a28 "CREATE_REPLICATION_SLOT \"node_2_16592_sync_16696\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT")
at libpqwalreceiver.c:585
#5 0x00007f2c1a5bc921 in libpqrcv_create_slot (conn=0x217f770, slotname=0x2182e38 "node_2_16592_sync_16696", temporary=1 '\001', snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7ffe17ed0d80)
at libpqwalreceiver.c:796
#6 0x0000000000832088 in LogicalRepSyncTableStart (origin_startpos=0x7ffe17ed0d80) at tablesync.c:917
#7 0x0000000000834d18 in ApplyWorkerMain (main_arg=2) at worker.c:1582
#8 0x00000000007eeab1 in StartBackgroundWorker () at bgworker.c:835
#9 0x000000000080267a in do_start_bgworker (rw=0x217b610) at postmaster.c:5680
#10 0x00000000008029c2 in maybe_start_bgworkers () at postmaster.c:5884
#11 0x00000000008019cc in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5073
#12 <signal handler called>
#13 0x00007f2c6edc7573 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#14 0x00000000007fd23c in ServerLoop () at postmaster.c:1717
#15 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x21516e0) at postmaster.c:1361
#16 0x0000000000734f08 in main (argc=3, argv=0x21516e0) at main.c:228

knizhnik(at)knizhnik:~$ gdb -p 1510
#0 0x00007f7b991569b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x000000000087185a in WaitEventSetWaitBlock (set=0x2479f40, cur_timeout=-1, occurred_events=0x7ffff3a42fb0, nevents=1) at latch.c:1048
#2 0x0000000000871716 in WaitEventSetWait (set=0x2479f40, timeout=-1, occurred_events=0x7ffff3a42fb0, nevents=1, wait_event_info=50331655) at latch.c:1000
#3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f7b975dd844, wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331655) at latch.c:385
#4 0x0000000000870d3e in WaitLatch (latch=0x7f7b975dd844, wakeEvents=1, timeout=0, wait_event_info=50331655) at latch.c:339
#5 0x000000000088995c in ProcSleep (locallock=0x2476470, lockMethodTable=0xc17420 <default_lockmethod>) at proc.c:1255
#6 0x00000000008828cb in WaitOnLock (locallock=0x2476470, owner=0x2517a48) at lock.c:1702
#7 0x000000000088157e in LockAcquireExtended (locktag=0x7ffff3a433c0, lockmode=8, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001') at lock.c:998
#8 0x0000000000880ba8 in LockAcquire (locktag=0x7ffff3a433c0, lockmode=8, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688
#9 0x0000000000880087 in LockSharedObject (classid=6100, objid=16589, objsubid=0, lockmode=8) at lmgr.c:882
#10 0x000000000067c361 in AlterSubscription (stmt=0x24dd558) at subscriptioncmds.c:645
#11 0x00000000008aa7ab in ProcessUtilitySlow (pstate=0x2479dd0, pstmt=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at utility.c:1629
#12 0x00000000008a90d0 in standard_ProcessUtility (pstmt=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at utility.c:928
#13 0x00007f7b98528884 in pathman_process_utility_hook (first_arg=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at src/hooks.c:913
#14 0x00000000008a80a2 in ProcessUtility (pstmt=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at utility.c:353
#15 0x00000000008a7075 in PortalRunUtility (portal=0x246be60, pstmt=0x24dd998, isTopLevel=0 '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at pquery.c:1178
#16 0x00000000008a728d in PortalRunMulti (portal=0x246be60, isTopLevel=0 '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at pquery.c:1324
#17 0x00000000008a6757 in PortalRun (portal=0x246be60, count=9223372036854775807, isTopLevel=0 '\000', run_once=1 '\001', dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x7ffff3a43c90 "")
at pquery.c:799
#18 0x00000000008a0288 in exec_simple_query (query_string=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION") at postgres.c:1099
#19 0x00000000008a4823 in PostgresMain (argc=1, argv=0x247c2d0, dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at postgres.c:4090
#20 0x0000000000801753 in BackendRun (port=0x2477770) at postmaster.c:4357
#21 0x0000000000800e5f in BackendStartup (port=0x2477770) at postmaster.c:4029
#22 0x00000000007fd398 in ServerLoop () at postmaster.c:1753
#23 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at postmaster.c:1361
#24 0x0000000000734f08 in main (argc=3, argv=0x244d6e0) at main.c:228

The reason of this deadlock seems to be clear: ALTER SUBSCRIPTION starts transaction at one node and tries to create slot at other node, which waiting for completion of all active transaction while building scnapshpot.
Is there any way to avoid this deadlock?

The same deadlock can happen in concurrent CREATE SUBSCRIPTION with implicit slot creation. But here it can be sovled by explicit slot creation prior to CREATE SUBSCRIPTION.
But what can I do if I want to update publications and refresh subscriptions?

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2017-10-24 18:14:53 unique index violation after pg_upgrade to PG10
Previous Message Alvaro Herrera 2017-10-24 16:17:15 Re: performance drop after upgrade (9.6 > 10)