Deadlock between backend and recovery may not be detected

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Deadlock between backend and recovery may not be detected
Date: 2020-12-16 12:49:07
Message-ID: 4041d6b6-cf24-a120-36fa-1294220f8243@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While reviewing the patch proposed at [1], I found that there is the case
where deadlock that recovery conflict on lock is involved in may not be
detected. This deadlock can happen between backends and the startup
process, in the standby server. Please see the following procedure to
reproduce the deadlock.

#1. Set up streaming replication.

#2. Set max_standby_streaming_delay to -1 in the standby.

#3. Create two tables in the primary.

[PRIMARY: SESSION1]
CREATE TABLE t1 ();
CREATE TABLE t2 ();

#4. Start transaction and access to the table t1.

[STANDBY: SESSION2]
BEGIN;
SELECT * FROM t1;

#5. Start transaction and lock table t2 in access exclusive mode,
in the primary. Also execute pg_switch_wal() to transfer WAL record
for access exclusive lock to the standby.

[PRIMARY: SESSION1]
BEGIN;
LOCK TABLE t2 IN ACCESS EXCLUSIVE MODE;
SELECT pg_switch_wal();

#6. Access to the table t2 within the transaction that started at #4,
in the standby.

[STANDBY: SESSION2]
SELECT * FROM t2;

#7. Lock table t1 in access exclusive mode within the transaction that
started in #5, in the primary. Also execute pg_switch_wal() to transfer
WAL record for access exclusive lock to the standby.

[PRIMARY: SESSION1]
LOCK TABLE t1 IN ACCESS EXCLUSIVE MODE;
SELECT pg_switch_wal();

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

> * Deadlocks involving the Startup process and an ordinary backend process
> * will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

Thought?

Regards,

[1] https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
recovery_conflict_lock_deadlock_v1.patch text/plain 1.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-12-16 12:50:27 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Previous Message Heikki Linnakangas 2020-12-16 12:17:58 Perform COPY FROM encoding conversions in larger chunks