Re: Deadlock between backend and recovery may not be detected

From: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Deadlock between backend and recovery may not be detected
Date: 2020-12-18 16:43:12
Message-ID: 2626254b-0578-875c-bb62-34e0accdb5ba@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 12/18/20 10:35 AM, Fujii Masao wrote:
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender
> and know the content is safe.
>
>
>
> On 2020/12/17 2:15, Fujii Masao wrote:
>>
>>
>> On 2020/12/16 23:28, Drouvot, Bertrand wrote:
>>> Hi,
>>>
>>> On 12/16/20 2:36 PM, Victor Yegorov wrote:
>>>>
>>>> *CAUTION*: This email originated from outside of the organization.
>>>> Do not click links or open attachments unless you can confirm the
>>>> sender and know the content is safe.
>>>>
>>>>
>>>> ср, 16 дек. 2020 г. в 13:49, Fujii Masao
>>>> <masao(dot)fujii(at)oss(dot)nttdata(dot)com <mailto:masao(dot)fujii(at)oss(dot)nttdata(dot)com>>:
>>>>
>>>>     After doing this procedure, you can see the startup process and
>>>> backend
>>>>     wait for the table lock each other, i.e., deadlock. But this
>>>> deadlock remains
>>>>     even after deadlock_timeout passes.
>>>>
>>>>     This seems a bug to me.
>>>>
>>> +1
>>>
>>>>
>>>>     > * Deadlocks involving the Startup process and an ordinary
>>>> backend process
>>>>     > * will be detected by the deadlock detector within the
>>>> ordinary backend.
>>>>
>>>>     The cause of this issue seems that
>>>> ResolveRecoveryConflictWithLock() that
>>>>     the startup process calls when recovery conflict on lock
>>>> happens doesn't
>>>>     take care of deadlock case at all. You can see this fact by
>>>> reading the above
>>>>     source code comment for ResolveRecoveryConflictWithLock().
>>>>
>>>>     To fix this issue, I think that we should enable
>>>> STANDBY_DEADLOCK_TIMEOUT
>>>>     timer in ResolveRecoveryConflictWithLock() so that the startup
>>>> process can
>>>>     send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the
>>>> backend.
>>>>     Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
>>>>     the backend should check whether the deadlock actually happens
>>>> or not.
>>>>     Attached is the POC patch implimenting this.
>>>>
>>> good catch!
>>>
>>> I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT
>>> shouldn't be set in ResolveRecoveryConflictWithLock() too (it is
>>> already set in ResolveRecoveryConflictWithBufferPin()).
>>>
>>> So + 1 to consider this as a bug and for the way the patch proposes
>>> to fix it.
>>
>> Thanks Victor and Bertrand for agreeing!
>> Attached is the updated version of the patch.
>
> Attached is v3 of the patch. Could you review this version?
>
> While the startup process is waiting for recovery conflict on buffer pin,
> it repeats sending the request for deadlock check to all the backends
> every deadlock_timeout. This may increase the workload in the startup
> process and backends, but since this is the original behavior, the patch
> doesn't change that.

Agree.

IMHO that may need to be rethink (as we are already in a conflict
situation, i am tempted to say that the less is being done the better it
is), but i think that's outside the scope of this patch.

> Also in practice this may not be so harmful because
> the period that the buffer is kept pinned is basically not so long.

Agree that chances are less to be in this mode for a "long" duration (as
compare to the lock conflict).

>
> On the other hand, IMO we should avoid this issue while the startup
> process is waiting for recovery conflict on locks since it can take
> a long time to release the locks. So the patch tries to fix it.
Agree
>
> Or I'm overthinking this? If this doesn't need to be handled,
> the patch can be simplified more. Thought?

I do think that's good to handle it that way for the lock conflict: the
less work is done the better it is (specially in a conflict situation).

The patch does look good to me.

Bertrand

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-12-18 16:49:02 initdb fails under bleeding-edge valgrind
Previous Message Stephen Frost 2020-12-18 16:30:16 Re: Refactor routine to check for ASCII-only case