Re: Add Information during standby recovery conflicts

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add Information during standby recovery conflicts
Date: 2020-12-03 17:53:59
Message-ID: a4d89c01-9084-24f7-f591-bb10080e32d8@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/12/01 17:29, Drouvot, Bertrand wrote:
> Hi,
>
> On 12/1/20 12:35 AM, Masahiko Sawada wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>>> On 2020-Dec-01, Fujii Masao wrote:
>>>
>>>> +                     if (proc)
>>>> +                     {
>>>> +                             if (nprocs == 0)
>>>> +                                     appendStringInfo(&buf, "%d", proc->pid);
>>>> +                             else
>>>> +                                     appendStringInfo(&buf, ", %d", proc->pid);
>>>> +
>>>> +                             nprocs++;
>>>>
>>>> What happens if all the backends in wait_list have gone? In other words,
>>>> how should we handle the case where nprocs == 0 (i.e., nprocs has not been
>>>> incrmented at all)? This would very rarely happen, but can happen.
>>>> In this case, since buf.data is empty, at least there seems no need to log
>>>> the list of conflicting processes in detail message.
>>> Yes, I noticed this too; this can be simplified by changing the
>>> condition in the ereport() call to be "nprocs > 0" (rather than
>>> wait_list being null), otherwise not print the errdetail.  (You could
>>> test buf.data or buf.len instead, but that seems uglier to me.)
>> +1
>>
>> Maybe we can also improve the comment of this function from:
>>
>> + * This function also reports the details about the conflicting
>> + * process ids if *wait_list is not NULL.
>>
>> to " This function also reports the details about the conflicting
>> process ids if exist" or something.
>>
> Thank you all for the review/remarks.
>
> They have been addressed in the new attached patch version.

Thanks for updating the patch! I read through the patch again
and applied the following chages to it. Attached is the updated
version of the patch. Could you review this version? If there is
no issue in it, I'm thinking to commit this version.

+ if (waitStart > 0 && !logged_recovery_conflict)
+ {
+ TimestampTz cur_ts = GetCurrentTimestamp();
+ if (TimestampDifferenceExceeds(waitStart, cur_ts,
+ DeadlockTimeout))

On the first time through, this is executed before we have started
actually waiting. Which is a bit wasteful. So I changed LockBufferForCleanup()
and ResolveRecoveryConflictWithVirtualXIDs() so that the code for logging
the recovery conflict is executed after the function to wait is executed.

+ ereport(LOG,
+ errmsg("recovery still waiting after %ld.%03d ms: %s",
+ msecs, usecs, _(get_recovery_conflict_desc(reason))),
+ wait_list > 0 ? errdetail_log_plural("Conflicting process: %s.",
+ "Conflicting processes: %s.",
+ nprocs, buf.data) : 0);

Seems "wait_list > 0" should be "nprocs > 0". So I changed the code that way.

+ if (waitStart > 0)
{
- const char *old_status;

I added "(!logged_recovery_conflict || new_status == NULL)" into
the above if-condition, to avoid executing again the code for logging
after PS title was updated and the recovery conflict was logged.

+ timeouts[cnt].id = STANDBY_TIMEOUT;
+ timeouts[cnt].type = TMPARAM_AFTER;
+ timeouts[cnt].delay_ms = DeadlockTimeout;

Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
I changed the code that way.

+ /*
+ * Log the recovery conflict if there is still virtual transaction
+ * conflicting with the lock.
+ */
+ if (cnt > 0)
+ {
+ LogRecoveryConflict(PROCSIG_RECOVERY_CONFLICT_LOCK,
+ standbyWaitStart, cur_ts, vxids);
+ logged_recovery_conflict = true;
+ }

I think that ProcSleep() should log the recovery conflict even if
there are no conflicting virtual transactions. Because the startup
process there has already waited longer than deadlock_timeout,
whether conflicting virtual transactions are still running or not.

Also LogRecoveryConflict() logs the recovery conflict even if it
finds that there are no conflicting active backends. So the rule
about whether to log the conflict when there are no conflicting
backends should be made consistent between functions, I think.
Thought?

Also I added more comments.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
v10-0008-Log-the-standby-recovery-conflict-waits.patch text/plain 19.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-12-03 18:14:56 Re: Minor documentation error regarding streaming replication protocol
Previous Message Jeff Davis 2020-12-03 17:04:21 Re: Minor documentation error regarding streaming replication protocol