Re: "ERROR: deadlock detected" when replicating TRUNCATE

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: "ERROR: deadlock detected" when replicating TRUNCATE
Date: 2021-05-17 08:47:14
Message-ID: CAA4eK1LKWUL_GdLew3-M_kFw8aCS91PrZqbdHfzaG2T4tLZ3AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 17, 2021 at 12:30 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> While doing logical replication testing we encountered a problem which
> causes a deadlock error to be logged when replicating a TRUNCATE
> simultaneously to 2 Subscriptions.
> e.g.
> ----------
> 2021-05-12 11:30:19.457 AEST [11393] ERROR: deadlock detected
> 2021-05-12 11:30:19.457 AEST [11393] DETAIL: Process 11393 waits for
> ShareLock on transaction 597; blocked by process 11582.
> Process 11582 waits for ShareLock on relation 16384 of database 14896;
> blocked by process 11393.
> ----------
>
> At this time, both the subscriber (apply worker) processes are
> executing inside the ExecuteTruncateGuts function simultaneously and
> they become co-dependent.
>
> e.g.
> ----------
> Process 11582
> (gdb) bt
> #0 0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1 0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
> cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
> latch.c:1450
> #2 0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
> occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
> at latch.c:1396
> #3 0x000000000093d8cd in WaitLatch (latch=0x7fa191042654,
> wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
> #4 0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8,
> lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361
..
> ----------
> Process 11393
> (gdb) bt
> #0 0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
> #1 0x00007fa197917e44 in sleep () from /lib64/libc.so.6
> #2 0x0000000000950f84 in DeadLockReport () at deadlock.c:1151
> #3 0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0,
> owner=0x2fd9a48) at lock.c:1873
>
..
> ----------
>
> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time.
>

Yeah, have you checked it in the back branches?

I am also able to reproduce and have analyzed the cause of the above
error. In the above, Process 11393 waits while updating pg_class tuple
via RelationSetNewRelfilenode() which is already updated by process
11582 (with transaction id 597 which is yet not committed). Now,
process 11582 waits for acquiring ShareLock on relation 16384 which is
acquired RowExclusiveMode by process 11393 in function
apply_handle_truncate. So, both the processes started waiting on each
other which causes a deadlock.

>
> PSA a patch adding a test for this scenario.
>

+
+$node_publisher->safe_psql('postgres',
+ "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1, sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");

Do you really need these steps to reproduce the problem? IIUC, this
has nothing to do with synchronous replication.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2021-05-17 09:02:29 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Previous Message Etsuro Fujita 2021-05-17 08:40:47 Re: Asynchronous Append on postgres_fdw nodes.