Re: Some problems of recovery conflict wait events

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Some problems of recovery conflict wait events
Date: 2020-03-12 06:12:44
Message-ID: CA+fd4k4fbwpb-T_uR-CRHBFLsEGUqFbykzmF8YM6H8-wPVADAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 11 Mar 2020 at 16:42, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
>
>
> On 2020/03/10 13:54, Masahiko Sawada wrote:
> > On Tue, 10 Mar 2020 at 00:57, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>
> >>
> >>
> >> On 2020/03/09 14:10, Masahiko Sawada wrote:
> >>> On Mon, 9 Mar 2020 at 13:24, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/03/08 13:52, Masahiko Sawada wrote:
> >>>>> On Thu, 5 Mar 2020 at 20:16, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/03/05 16:58, Masahiko Sawada wrote:
> >>>>>>> On Wed, 4 Mar 2020 at 15:21, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2020/03/04 14:31, Masahiko Sawada wrote:
> >>>>>>>>> On Wed, 4 Mar 2020 at 13:48, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 2020/03/04 13:27, Michael Paquier wrote:
> >>>>>>>>>>> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
> >>>>>>>>>>>> Yeah, so 0001 patch sets existing wait events to recovery conflict
> >>>>>>>>>>>> resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
> >>>>>>>>>>>> to the recovery conflict on a snapshot. 0003 patch improves these wait
> >>>>>>>>>>>> events by adding the new type of wait event such as
> >>>>>>>>>>>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
> >>>>>>>>>>>> is the fix for existing versions and 0003 patch is an improvement for
> >>>>>>>>>>>> only PG13. Did you mean even 0001 patch doesn't fit for back-patching?
> >>>>>>>>>>
> >>>>>>>>>> Yes, it looks like a improvement rather than bug fix.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Okay, understand.
> >>>>>>>>>
> >>>>>>>>>>> I got my eyes on this patch set. The full patch set is in my opinion
> >>>>>>>>>>> just a set of improvements, and not bug fixes, so I would refrain from
> >>>>>>>>>>> back-backpatching.
> >>>>>>>>>>
> >>>>>>>>>> I think that the issue (i.e., "waiting" is reported twice needlessly
> >>>>>>>>>> in PS display) that 0002 patch tries to fix is a bug. So it should be
> >>>>>>>>>> fixed even in the back branches.
> >>>>>>>>>
> >>>>>>>>> So we need only two patches: one fixes process title issue and another
> >>>>>>>>> improve wait event. I've attached updated patches.
> >>>>>>>>
> >>>>>>>> Thanks for updating the patches! I started reading 0001 patch.
> >>>>>>>
> >>>>>>> Thank you for reviewing this patch.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> - /*
> >>>>>>>> - * Report via ps if we have been waiting for more than 500 msec
> >>>>>>>> - * (should that be configurable?)
> >>>>>>>> - */
> >>>>>>>> - if (update_process_title && new_status == NULL &&
> >>>>>>>> - TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
> >>>>>>>> - 500))
> >>>>>>>>
> >>>>>>>> The patch changes ResolveRecoveryConflictWithSnapshot() and
> >>>>>>>> ResolveRecoveryConflictWithTablespace() so that they always add
> >>>>>>>> "waiting" into the PS display, whether wait is really necessary or not.
> >>>>>>>> But isn't it better to display "waiting" in PS basically when wait is
> >>>>>>>> necessary, like originally ResolveRecoveryConflictWithVirtualXIDs()
> >>>>>>>> does as the above?
> >>>>>>>
> >>>>>>> You're right. Will fix it.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> ResolveRecoveryConflictWithDatabase(Oid dbid)
> >>>>>>>> {
> >>>>>>>> + char *new_status = NULL;
> >>>>>>>> +
> >>>>>>>> + /* Report via ps we are waiting */
> >>>>>>>> + new_status = set_process_title_waiting();
> >>>>>>>>
> >>>>>>>> In ResolveRecoveryConflictWithDatabase(), there seems no need to
> >>>>>>>> display "waiting" in PS because no wait occurs when recovery conflict
> >>>>>>>> with database happens.
> >>>>>>>
> >>>>>>> Isn't the startup process waiting for other backend to terminate?
> >>>>>>
> >>>>>> Yeah, you're right. I agree that "waiting" should be reported in this case.
> >>>>>>
> >>>>>> Currently ResolveRecoveryConflictWithLock() and
> >>>>>> ResolveRecoveryConflictWithBufferPin() don't call
> >>>>>> ResolveRecoveryConflictWithVirtualXIDs and don't report "waiting"
> >>>>>> in PS display. You changed them so that they report "waiting". I agree
> >>>>>> to have this change. But this change is an improvement rather than
> >>>>>> a bug fix, i.e., we should apply this change only in v13?
> >>>>>>
> >>>>>
> >>>>> Did you mean ResolveRecoveryConflictWithDatabase and
> >>>>> ResolveRecoveryConflictWithBufferPin?
> >>>>
> >>>> Yes! Sorry for my typo.
> >>>>
> >>>>> In the current code as far as I
> >>>>> researched there are two cases where we don't add "waiting" and one
> >>>>> case where we doubly add "waiting".
> >>>>>
> >>>>> ResolveRecoveryConflictWithDatabase and
> >>>>> ResolveRecoveryConflictWithBufferPin don't update the ps title.
> >>>>> Although the path where GetCurrentTimestamp() >= ltime is false in
> >>>>> ResolveRecoveryConflictWithLock also doesn't update the ps title, it's
> >>>>> already updated in WaitOnLock. On the other hand, the path where
> >>>>> GetCurrentTimestamp() >= ltime is true in
> >>>>> ResolveRecoveryConflictWithLock updates the ps title but it's wrong as
> >>>>> I reported.
> >>>>>
> >>>>> I've split the patch into two patches: 0001 patch fixes the issue
> >>>>> about doubly updating ps title, 0002 patch makes the recovery conflict
> >>>>> resolution on database and buffer pin update the ps title.
> >>>>
> >>>> Thanks for splitting the patches. I think that 0001 patch can be back-patched.
> >>>>
> >>>> - /*
> >>>> - * Report via ps if we have been waiting for more than 500 msec
> >>>> - * (should that be configurable?)
> >>>> - */
> >>>> - if (update_process_title && new_status == NULL &&
> >>>> - TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
> >>>> - 500))
> >>>>
> >>>> Originally, "waiting" is reported in PS if we've been waiting for more than
> >>>> 500 msec, as the above does. But you got rid of those codes in the patch.
> >>>> Did you confirm that it's safe to do that? If not, isn't it better to apply
> >>>> the attached patch?
> >>>
> >>> In WaitOnLock() we update the ps title regardless of waiting time. So
> >>> I thought we can change it to make these behavior consistent. But
> >>> considering back-patch, your patch looks better than mine.
> >>
> >> Yeah, so I pushed the 0001 patch at first!
> >> I will review the other patches later.
> >
> > Thank you!
> >
> > For 0002 patch which makes ResolveRecoveryConflictWithDatabase and
> > ResolveRecoveryConflictWithBufferPin update the ps title, I think
> > these are better to wait for 5ms before updating the ps title like
> > ResolveRecoveryConflictWithVirtualXIDs, for consistency among recovery
> > conflict resolution functions, but what do you think?
>
> Maybe yes.
>
> As another idea, for consistency, we can change all
> ResolveRecoveryConflictWithXXX() so that they don't wait
> at all before reporting "waiting". But if we don't do that,
> "waiting" can be reported even when we can immediately
> cancel or terminate the conflicting transactions (e.g., in
> case of max_standby_streaming_delay=0). To avoid this
> issue, I think it's better to wait for 500ms.

Agreed.

>
> The 0002 patch changes ResolveRecoveryConflictWithBufferPin()
> so that it updates PS every time. But this seems not good
> because the update can happen very frequently. Thought?

Agreed. In the updated version patch, I update the process title in
LockBufferForCleanup() only once when we've been waiting for more than
500 ms. This change also affects the primary server that is waiting
for buffer cleanup lock. I think it would not be bad but it's
different behaviour from LockBuffer().

I've attached the updated version patch. Please review it.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
update_process_title_v4.patch application/octet-stream 5.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2020-03-12 06:14:12 Re: Berserk Autovacuum (let's save next Mandrill)
Previous Message Craig Ringer 2020-03-12 06:08:31 Re: [PATCH] Skip llvm bytecode generation if LLVM is missing