Re: Add Information during standby recovery conflicts

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add Information during standby recovery conflicts
Date: 2020-12-14 15:20:32
Message-ID: d78ead3f-b7e1-74ab-7b61-6065bbf8f6c1@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/12/14 21:31, Fujii Masao wrote:
>
>
> On 2020/12/05 12:38, Masahiko Sawada wrote:
>> On Fri, Dec 4, 2020 at 7:22 PM Drouvot, Bertrand <bdrouvot(at)amazon(dot)com> wrote:
>>>
>>> Hi,
>>>
>>> On 12/4/20 2:21 AM, Fujii Masao wrote:
>>>>
>>>> On 2020/12/04 9:28, Masahiko Sawada wrote:
>>>>> On Fri, Dec 4, 2020 at 2:54 AM Fujii Masao
>>>>> <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/12/01 17:29, Drouvot, Bertrand wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 12/1/20 12:35 AM, Masahiko Sawada wrote:
>>>>>>>> CAUTION: This email originated from outside of the organization.
>>>>>>>> Do not click links or open attachments unless you can confirm the
>>>>>>>> sender and know the content is safe.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera
>>>>>>>> <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>>>>>>>>> On 2020-Dec-01, Fujii Masao wrote:
>>>>>>>>>
>>>>>>>>>> +                     if (proc)
>>>>>>>>>> +                     {
>>>>>>>>>> +                             if (nprocs == 0)
>>>>>>>>>> + appendStringInfo(&buf, "%d", proc->pid);
>>>>>>>>>> +                             else
>>>>>>>>>> + appendStringInfo(&buf, ", %d", proc->pid);
>>>>>>>>>> +
>>>>>>>>>> +                             nprocs++;
>>>>>>>>>>
>>>>>>>>>> What happens if all the backends in wait_list have gone? In
>>>>>>>>>> other words,
>>>>>>>>>> how should we handle the case where nprocs == 0 (i.e., nprocs
>>>>>>>>>> has not been
>>>>>>>>>> incrmented at all)? This would very rarely happen, but can happen.
>>>>>>>>>> In this case, since buf.data is empty, at least there seems no
>>>>>>>>>> need to log
>>>>>>>>>> the list of conflicting processes in detail message.
>>>>>>>>> Yes, I noticed this too; this can be simplified by changing the
>>>>>>>>> condition in the ereport() call to be "nprocs > 0" (rather than
>>>>>>>>> wait_list being null), otherwise not print the errdetail.  (You
>>>>>>>>> could
>>>>>>>>> test buf.data or buf.len instead, but that seems uglier to me.)
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Maybe we can also improve the comment of this function from:
>>>>>>>>
>>>>>>>> + * This function also reports the details about the conflicting
>>>>>>>> + * process ids if *wait_list is not NULL.
>>>>>>>>
>>>>>>>> to " This function also reports the details about the conflicting
>>>>>>>> process ids if exist" or something.
>>>>>>>>
>>>>>>> Thank you all for the review/remarks.
>>>>>>>
>>>>>>> They have been addressed in the new attached patch version.
>>>>>>
>>>>>> Thanks for updating the patch! I read through the patch again
>>>>>> and applied the following chages to it. Attached is the updated
>>>>>> version of the patch. Could you review this version? If there is
>>>>>> no issue in it, I'm thinking to commit this version.
>>>>>
>>>>> Thank you for updating the patch! I have one question.
>>>>>
>>>>>>
>>>>>> +                       timeouts[cnt].id = STANDBY_TIMEOUT;
>>>>>> +                       timeouts[cnt].type = TMPARAM_AFTER;
>>>>>> +                       timeouts[cnt].delay_ms = DeadlockTimeout;
>>>>>>
>>>>>> Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
>>>>>> I changed the code that way.
>>>>>
>>>>> As the comment of ResolveRecoveryConflictWithLock() says the
>>>>> following, a deadlock is detected by the ordinary backend process:
>>>>>
>>>>>    * Deadlocks involving the Startup process and an ordinary backend
>>>>> proces
>>>>>    * will be detected by the deadlock detector within the ordinary
>>>>> backend.
>>>>>
>>>>> If we use STANDBY_DEADLOCK_TIMEOUT,
>>>>> SendRecoveryConflictWithBufferPin() will be called after
>>>>> DeadlockTimeout passed, but I think it's not necessary for the startup
>>>>> process in this case.
>>>>
>>>> Thanks for pointing this! You are right.
>>>>
>>>>
>>>>> If we want to just wake up the startup process
>>>>> maybe we can use STANDBY_TIMEOUT here?
>>>>
>>> Thanks for the patch updates! Except what we are still discussing below,
>>> it looks good to me.
>>>
>>>> When STANDBY_TIMEOUT happens, a request to release conflicting buffer
>>>> pins is sent. Right? If so, we should not also use STANDBY_TIMEOUT there?
>>>
>>> Agree
>>>
>>>>
>>>> Or, first of all, we don't need to enable the deadlock timer at all?
>>>> Since what we'd like to do is to wake up after deadlock_timeout
>>>> passes, we can do that by changing ProcWaitForSignal() so that it can
>>>> accept the timeout and giving the deadlock_timeout to it. If we do
>>>> this, maybe we can get rid of STANDBY_LOCK_TIMEOUT from
>>>> ResolveRecoveryConflictWithLock(). Thought?
>>
>> Where do we enable deadlock timeout in hot standby case? You meant to
>> enable it in ProcWaitForSignal() or where we set a timer for not hot
>> standby case, in ProcSleep()?
>
> No, what I tried to say is to change ResolveRecoveryConflictWithLock() so that it does
>
> 1. calculate the "minimum" timeout from deadlock_timeout and max_standby_xxx_delay
> 2. give the calculated timeout value to ProcWaitForSignal()
> 3. wait for signal and timeout on ProcWaitForSignal()
>
> Since ProcWaitForSignal() calls WaitLatch(), seems it's not so difficult to make ProcWaitForSignal() handle the timeout. If we do this, I was thinking that we can get rid of enable_timeouts() from ResolveRecoveryConflictWithLock().
>
>
>>
>>>
>>> Why not simply use (again) the STANDBY_LOCK_TIMEOUT one? (as it triggers
>>> a call to StandbyLockTimeoutHandler() which does nothing, except waking
>>> up. That's what we want, right?)
>>
>> Right, what I wanted to mean is STANDBY_LOCK_TIMEOUT. The startup
>> process can wake up and do nothing. Thank you for pointing out.
>
> Okay, understood! Firstly I was thinking that enabling the same type (i.e., STANDBY_LOCK_TIMEOUT) of lock twice doesn't work properly, but as far as I read the code, it works. In that case, only the shorter timeout would be activated in enable_timeouts(). So I agree to use STANDBY_LOCK_TIMEOUT.

So I renamed the argument "deadlock_timer" in ResolveRecoveryConflictWithLock()
because it's not the timer for deadlock and is confusing. Attached is the
updated version of the patch. Barring any objection, I will commit this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
v11-0002-Log-the-standby-recovery-conflict-waits.patch text/plain 19.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2020-12-14 15:49:16 Re: Add Information during standby recovery conflicts
Previous Message Arne Roland 2020-12-14 15:01:17 Re: a misbehavior of partition row movement (?)