Re: Deadlock between backend and recovery may not be detected

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deadlock between backend and recovery may not be detected
Date: 2020-12-18 09:35:50
Message-ID: 82f60836-94e7-40ce-76a5-c2828df9cfbe@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/12/17 2:15, Fujii Masao wrote:
>
>
> On 2020/12/16 23:28, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 12/16/20 2:36 PM, Victor Yegorov wrote:
>>>
>>> *CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>> ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com <mailto:masao(dot)fujii(at)oss(dot)nttdata(dot)com>>:
>>>
>>>     After doing this procedure, you can see the startup process and backend
>>>     wait for the table lock each other, i.e., deadlock. But this deadlock remains
>>>     even after deadlock_timeout passes.
>>>
>>>     This seems a bug to me.
>>>
>> +1
>>
>>>
>>>     > * Deadlocks involving the Startup process and an ordinary backend process
>>>     > * will be detected by the deadlock detector within the ordinary backend.
>>>
>>>     The cause of this issue seems that ResolveRecoveryConflictWithLock() that
>>>     the startup process calls when recovery conflict on lock happens doesn't
>>>     take care of deadlock case at all. You can see this fact by reading the above
>>>     source code comment for ResolveRecoveryConflictWithLock().
>>>
>>>     To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
>>>     timer in ResolveRecoveryConflictWithLock() so that the startup process can
>>>     send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
>>>     Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
>>>     the backend should check whether the deadlock actually happens or not.
>>>     Attached is the POC patch implimenting this.
>>>
>> good catch!
>>
>> I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).
>>
>> So + 1 to consider this as a bug and for the way the patch proposes to fix it.
>
> Thanks Victor and Bertrand for agreeing!
> Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

On the other hand, IMO we should avoid this issue while the startup
process is waiting for recovery conflict on locks since it can take
a long time to release the locks. So the patch tries to fix it.

Or I'm overthinking this? If this doesn't need to be handled,
the patch can be simplified more. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
recovery_conflict_lock_deadlock_v3.patch text/plain 9.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2020-12-18 09:51:55 Re: Incorrect allocation handling for cryptohash functions with OpenSSL
Previous Message Heikki Linnakangas 2020-12-18 09:35:14 Re: Incorrect allocation handling for cryptohash functions with OpenSSL