Re: Deadlock between backend and recovery may not be detected

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deadlock between backend and recovery may not be detected
Date: 2020-12-22 11:41:02
Message-ID: 3bb4666a-1e8b-dd42-6483-db2b8b81f804@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/12/19 1:43, Drouvot, Bertrand wrote:
> Hi,
>
> On 12/18/20 10:35 AM, Fujii Masao wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On 2020/12/17 2:15, Fujii Masao wrote:
>>>
>>>
>>> On 2020/12/16 23:28, Drouvot, Bertrand wrote:
>>>> Hi,
>>>>
>>>> On 12/16/20 2:36 PM, Victor Yegorov wrote:
>>>>>
>>>>> *CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>>>
>>>>>
>>>>> ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com <mailto:masao(dot)fujii(at)oss(dot)nttdata(dot)com>>:
>>>>>
>>>>>     After doing this procedure, you can see the startup process and backend
>>>>>     wait for the table lock each other, i.e., deadlock. But this deadlock remains
>>>>>     even after deadlock_timeout passes.
>>>>>
>>>>>     This seems a bug to me.
>>>>>
>>>> +1
>>>>
>>>>>
>>>>>     > * Deadlocks involving the Startup process and an ordinary backend process
>>>>>     > * will be detected by the deadlock detector within the ordinary backend.
>>>>>
>>>>>     The cause of this issue seems that ResolveRecoveryConflictWithLock() that
>>>>>     the startup process calls when recovery conflict on lock happens doesn't
>>>>>     take care of deadlock case at all. You can see this fact by reading the above
>>>>>     source code comment for ResolveRecoveryConflictWithLock().
>>>>>
>>>>>     To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
>>>>>     timer in ResolveRecoveryConflictWithLock() so that the startup process can
>>>>>     send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
>>>>>     Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
>>>>>     the backend should check whether the deadlock actually happens or not.
>>>>>     Attached is the POC patch implimenting this.
>>>>>
>>>> good catch!
>>>>
>>>> I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).
>>>>
>>>> So + 1 to consider this as a bug and for the way the patch proposes to fix it.
>>>
>>> Thanks Victor and Bertrand for agreeing!
>>> Attached is the updated version of the patch.
>>
>> Attached is v3 of the patch. Could you review this version?
>>
>> While the startup process is waiting for recovery conflict on buffer pin,
>> it repeats sending the request for deadlock check to all the backends
>> every deadlock_timeout. This may increase the workload in the startup
>> process and backends, but since this is the original behavior, the patch
>> doesn't change that.
>
> Agree.
>
> IMHO that may need to be rethink (as we are already in a conflict situation, i am tempted to say that the less is being done the better it is), but i think that's outside the scope of this patch.

Yes, I agree that's better. I think that we should improve that as a separate
patch only for master branch, after fixing the bug and back-patching that
at first.

>
>> Also in practice this may not be so harmful because
>> the period that the buffer is kept pinned is basically not so long.
>
> Agree that chances are less to be in this mode for a "long" duration (as compare to the lock conflict).
>
>>
>> On the other hand, IMO we should avoid this issue while the startup
>> process is waiting for recovery conflict on locks since it can take
>> a long time to release the locks. So the patch tries to fix it.
> Agree
>>
>> Or I'm overthinking this? If this doesn't need to be handled,
>> the patch can be simplified more. Thought?
>
> I do think that's good to handle it that way for the lock conflict: the less work is done the better it is (specially in a conflict situation).
>
> The patch does look good to me.

Thanks for the review!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-12-22 11:42:00 Re: Deadlock between backend and recovery may not be detected
Previous Message Peter Smith 2020-12-22 11:28:21 Re: Single transaction in the tablesync worker?