Re: Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION
Date: 2017-10-31 12:24:22
Message-ID: CAD21AoC=ETDqenL4A0y6XrqjckWCq145EFDUi2UbBVL_RarBjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 24, 2017 at 7:13 PM, Konstantin Knizhnik
<k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
> Parallel execution of ALTER SUBSCRIPTION REFRESH PUBLICATION at several
> nodes may cause deadlock:
>
> knizhnik 1480 0.0 0.1 1417532 16496 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16589 sync 16720
> waiting
> knizhnik 1481 0.0 0.1 1417668 17668 ? Ss 20:01 0:00 postgres:
> knizhnik postgres 127.0.0.1(36378) idle
> knizhnik 1484 0.0 0.1 1406952 16676 ? Ss 20:01 0:00 postgres:
> knizhnik postgres 127.0.0.1(56194) ALTER SUBSCRIPTION waiting
> knizhnik 1486 0.0 0.1 1410040 21268 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(36386) idle
> knizhnik 1487 0.0 0.1 1417532 16736 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16589 sync 16774
> knizhnik 1489 0.0 0.1 1410040 21336 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(36388) idle
> knizhnik 1510 0.0 0.1 1406952 16820 ? Ss 20:01 0:00 postgres:
> knizhnik postgres 127.0.0.1(56228) ALTER SUBSCRIPTION waiting
> knizhnik 1530 0.0 0.1 1406952 16736 ? Ss 20:01 0:00 postgres:
> knizhnik postgres 127.0.0.1(56260) ALTER SUBSCRIPTION waiting
> knizhnik 1534 0.0 0.0 1417504 14360 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16590
> knizhnik 1537 0.0 0.1 1409164 16920 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(45054) idle
> knizhnik 1545 0.0 0.0 1417344 14576 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16592
> knizhnik 1546 0.0 0.1 1409168 16588 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(56274) idle
> knizhnik 1547 0.0 0.0 1417348 14332 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16592 sync 16606
> knizhnik 1549 0.0 0.0 1409004 14512 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(56276) idle in transaction waiting
> knizhnik 1553 0.0 0.0 1417348 14540 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16592
> knizhnik 1554 0.0 0.1 1409168 16596 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(56280) idle
> knizhnik 1555 0.0 0.0 1417348 14332 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16592 sync 16660
> knizhnik 1556 0.0 0.0 1409004 14520 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(56282) idle in transaction waiting
> knizhnik 1558 0.0 0.0 1417348 14460 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16592 sync 16696
> knizhnik 1560 0.0 0.0 1409004 14516 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(56286) idle in transaction waiting
> knizhnik 1562 0.0 0.0 1417348 14084 ? Ss 20:01 0:00 postgres:
> bgworker: logical replication worker for subscription 16592 sync 16732
> knizhnik 1567 0.0 0.0 1409004 14516 ? Ss 20:01 0:00 postgres:
> wal sender process knizhnik 127.0.0.1(56288) idle in transaction waiting
>
> knizhnik(at)knizhnik:~$ gdb -p 1556
> (gdb) bt
> #0 0x00007f7b991569b3 in __epoll_wait_nocancel () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x000000000087185a in WaitEventSetWaitBlock (set=0x246d1e8,
> cur_timeout=-1, occurred_events=0x7ffff3a43580, nevents=1) at latch.c:1048
> #2 0x0000000000871716 in WaitEventSetWait (set=0x246d1e8, timeout=-1,
> occurred_events=0x7ffff3a43580, nevents=1, wait_event_info=50331652) at
> latch.c:1000
> #3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f7b975daba4,
> wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331652) at latch.c:385
> #4 0x0000000000870d3e in WaitLatch (latch=0x7f7b975daba4, wakeEvents=1,
> timeout=0, wait_event_info=50331652) at latch.c:339
> #5 0x000000000088995c in ProcSleep (locallock=0x24764d0,
> lockMethodTable=0xc17420 <default_lockmethod>) at proc.c:1255
> #6 0x00000000008828cb in WaitOnLock (locallock=0x24764d0, owner=0x2517c78)
> at lock.c:1702
> #7 0x000000000088157e in LockAcquireExtended (locktag=0x7ffff3a439a0,
> lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1
> '\001') at lock.c:998
> #8 0x0000000000880ba8 in LockAcquire (locktag=0x7ffff3a439a0, lockmode=5,
> sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688
> #9 0x000000000087f8dc in XactLockTableWait (xid=624, rel=0x0, ctid=0x0,
> oper=XLTW_None) at lmgr.c:587
> #10 0x000000000082f524 in SnapBuildWaitSnapshot (running=0x246d1b0,
> cutoff=632) at snapbuild.c:1400
> #11 0x000000000082f2f6 in SnapBuildFindSnapshot (builder=0x254b550,
> lsn=24410008, running=0x246d1b0) at snapbuild.c:1311
> #12 0x000000000082ee67 in SnapBuildProcessRunningXacts (builder=0x254b550,
> lsn=24410008, running=0x246d1b0) at snapbuild.c:1106
> #13 0x000000000081c458 in DecodeStandbyOp (ctx=0x25334b0,
> buf=0x7ffff3a43b10) at decode.c:314
> #14 0x000000000081bf39 in LogicalDecodingProcessRecord (ctx=0x25334b0,
> record=0x2533770) at decode.c:117
> #15 0x0000000000820219 in DecodingContextFindStartpoint (ctx=0x25334b0) at
> logical.c:458
> #16 0x000000000083674a in CreateReplicationSlot (cmd=0x246c418) at
> walsender.c:934
> #17 0x00000000008374ef in exec_replication_command (cmd_string=0x24dd140
> "CREATE_REPLICATION_SLOT \"node_1_16592_sync_16660\" TEMPORARY LOGICAL
> pgoutput USE_SNAPSHOT") at walsender.c:1511
> #18 0x00000000008a47ff in PostgresMain (argc=1, argv=0x247c468,
> dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at
> postgres.c:4086
> #19 0x0000000000801753 in BackendRun (port=0x2477db0) at postmaster.c:4357
> #20 0x0000000000800e5f in BackendStartup (port=0x2477db0) at
> postmaster.c:4029
> #21 0x00000000007fd398 in ServerLoop () at postmaster.c:1753
> #22 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at
> postmaster.c:1361
> #23 0x0000000000734f08 in main (argc=3, argv=0x244d6e0) at main.c:228
>
> knizhnik(at)knizhnik:~$ gdb -p 1558
> #0 0x00007f2c6edd19b3 in __epoll_wait_nocancel () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x000000000087185a in WaitEventSetWaitBlock (set=0x21d8ec8,
> cur_timeout=-1, occurred_events=0x7ffe17ed0c20, nevents=1) at latch.c:1048
> #2 0x0000000000871716 in WaitEventSetWait (set=0x21d8ec8, timeout=-1,
> occurred_events=0x7ffe17ed0c20, nevents=1, wait_event_info=100663299) at
> latch.c:1000
> #3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f2c6d2825a4,
> wakeEvents=19, sock=4, timeout=-1, wait_event_info=100663299) at latch.c:385
> #4 0x00007f2c1a5bc304 in libpqrcv_PQexec (streamConn=0x21e0ca0,
> query=0x21d9a28 "CREATE_REPLICATION_SLOT \"node_2_16592_sync_16696\"
> TEMPORARY LOGICAL pgoutput USE_SNAPSHOT")
> at libpqwalreceiver.c:585
> #5 0x00007f2c1a5bc921 in libpqrcv_create_slot (conn=0x217f770,
> slotname=0x2182e38 "node_2_16592_sync_16696", temporary=1 '\001',
> snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7ffe17ed0d80)
> at libpqwalreceiver.c:796
> #6 0x0000000000832088 in LogicalRepSyncTableStart
> (origin_startpos=0x7ffe17ed0d80) at tablesync.c:917
> #7 0x0000000000834d18 in ApplyWorkerMain (main_arg=2) at worker.c:1582
> #8 0x00000000007eeab1 in StartBackgroundWorker () at bgworker.c:835
> #9 0x000000000080267a in do_start_bgworker (rw=0x217b610) at
> postmaster.c:5680
> #10 0x00000000008029c2 in maybe_start_bgworkers () at postmaster.c:5884
> #11 0x00000000008019cc in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5073
> #12 <signal handler called>
> #13 0x00007f2c6edc7573 in __select_nocancel () at
> ../sysdeps/unix/syscall-template.S:84
> #14 0x00000000007fd23c in ServerLoop () at postmaster.c:1717
> #15 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x21516e0) at
> postmaster.c:1361
> #16 0x0000000000734f08 in main (argc=3, argv=0x21516e0) at main.c:228
>
> knizhnik(at)knizhnik:~$ gdb -p 1510
> #0 0x00007f7b991569b3 in __epoll_wait_nocancel () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x000000000087185a in WaitEventSetWaitBlock (set=0x2479f40,
> cur_timeout=-1, occurred_events=0x7ffff3a42fb0, nevents=1) at latch.c:1048
> #2 0x0000000000871716 in WaitEventSetWait (set=0x2479f40, timeout=-1,
> occurred_events=0x7ffff3a42fb0, nevents=1, wait_event_info=50331655) at
> latch.c:1000
> #3 0x0000000000870e6c in WaitLatchOrSocket (latch=0x7f7b975dd844,
> wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331655) at latch.c:385
> #4 0x0000000000870d3e in WaitLatch (latch=0x7f7b975dd844, wakeEvents=1,
> timeout=0, wait_event_info=50331655) at latch.c:339
> #5 0x000000000088995c in ProcSleep (locallock=0x2476470,
> lockMethodTable=0xc17420 <default_lockmethod>) at proc.c:1255
> #6 0x00000000008828cb in WaitOnLock (locallock=0x2476470, owner=0x2517a48)
> at lock.c:1702
> #7 0x000000000088157e in LockAcquireExtended (locktag=0x7ffff3a433c0,
> lockmode=8, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1
> '\001') at lock.c:998
> #8 0x0000000000880ba8 in LockAcquire (locktag=0x7ffff3a433c0, lockmode=8,
> sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688
> #9 0x0000000000880087 in LockSharedObject (classid=6100, objid=16589,
> objsubid=0, lockmode=8) at lmgr.c:882
> #10 0x000000000067c361 in AlterSubscription (stmt=0x24dd558) at
> subscriptioncmds.c:645
> #11 0x00000000008aa7ab in ProcessUtilitySlow (pstate=0x2479dd0,
> pstmt=0x24dd998, queryString=0x24dc920 "SET SESSION synchronous_commit TO
> local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
> context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8,
> completionTag=0x7ffff3a43c90 "") at utility.c:1629
> #12 0x00000000008a90d0 in standard_ProcessUtility (pstmt=0x24dd998,
> queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER
> SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
> context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8,
> completionTag=0x7ffff3a43c90 "") at utility.c:928
> #13 0x00007f7b98528884 in pathman_process_utility_hook (first_arg=0x24dd998,
> queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER
> SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
> context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8,
> completionTag=0x7ffff3a43c90 "") at src/hooks.c:913
> #14 0x00000000008a80a2 in ProcessUtility (pstmt=0x24dd998,
> queryString=0x24dc920 "SET SESSION synchronous_commit TO local; ALTER
> SUBSCRIPTION sub_3_1 REFRESH PUBLICATION",
> context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8,
> completionTag=0x7ffff3a43c90 "") at utility.c:353
> #15 0x00000000008a7075 in PortalRunUtility (portal=0x246be60,
> pstmt=0x24dd998, isTopLevel=0 '\000', setHoldSnapshot=0 '\000',
> dest=0x24ddcb8, completionTag=0x7ffff3a43c90 "") at pquery.c:1178
> #16 0x00000000008a728d in PortalRunMulti (portal=0x246be60, isTopLevel=0
> '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, altdest=0x24ddcb8,
> completionTag=0x7ffff3a43c90 "") at pquery.c:1324
> #17 0x00000000008a6757 in PortalRun (portal=0x246be60,
> count=9223372036854775807, isTopLevel=0 '\000', run_once=1 '\001',
> dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x7ffff3a43c90 "")
> at pquery.c:799
> #18 0x00000000008a0288 in exec_simple_query (query_string=0x24dc920 "SET
> SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH
> PUBLICATION") at postgres.c:1099
> #19 0x00000000008a4823 in PostgresMain (argc=1, argv=0x247c2d0,
> dbname=0x247c2a8 "postgres", username=0x244f870 "knizhnik") at
> postgres.c:4090
> #20 0x0000000000801753 in BackendRun (port=0x2477770) at postmaster.c:4357
> #21 0x0000000000800e5f in BackendStartup (port=0x2477770) at
> postmaster.c:4029
> #22 0x00000000007fd398 in ServerLoop () at postmaster.c:1753
> #23 0x00000000007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at
> postmaster.c:1361
> #24 0x0000000000734f08 in main (argc=3, argv=0x244d6e0) at main.c:228
>
>
>
> The reason of this deadlock seems to be clear: ALTER SUBSCRIPTION starts
> transaction at one node and tries to create slot at other node, which
> waiting for completion of all active transaction while building scnapshpot.
> Is there any way to avoid this deadlock?

I think that there is no way to avoid this deadlock except for
refreshing subscriptions sequentially. One solution I came up with is
to change ALTER TABLE REFRESH so that we can specify replication slot
names used for table sync, or can accord the ERRCODE_DUPLICATE_OBJECT
error when creating a replication slot. These are obviously poor
solutions, though.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-10-31 12:27:02 Re: Remove inbound links to sql-createuser
Previous Message Vitaly Burovoy 2017-10-31 12:08:49 Fix dumping pre-10 DBs by pg_dump10 if table "name" exists