Re: Is Recovery actually paused?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: dilipbalaut(at)gmail(dot)com
Cc: nagata(at)sraoss(dot)co(dot)jp, bharath(dot)rupireddyforpostgres(at)gmail(dot)com, sawada(dot)mshk(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, robertmhaas(at)gmail(dot)com, simon(at)2ndquadrant(dot)com
Subject: Re: Is Recovery actually paused?
Date: 2021-02-09 01:58:04
Message-ID: 20210209.105804.245840302061999932.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 8 Feb 2021 17:05:52 +0530, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote in
> On Mon, Feb 8, 2021 at 2:19 PM Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:
> >
> > On Mon, 08 Feb 2021 17:32:46 +0900 (JST)
> > Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> >
> > > At Mon, 8 Feb 2021 14:12:35 +0900, Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote in
> > > > > > > I think the right fix should be that the state should never go from
> > > > > > > ‘paused’ to ‘pause requested’ so I think pg_wal_replay_pause should take
> > > > > > > care of that.
> > > > > >
> > > > > > It makes sense to take care of this in pg_wal_replay_pause, but I wonder
> > > > > > it can not handle the case that a user resume and pause again while a sleep.
> > > > >
> > > > > Right, we will have to check and set in the loop. But we should not
> > > > > allow the state to go from paused to pause requested irrespective of
> > > > > this.
> > > >
> > > > I agree with you.
> > >
> > > Is there any actual harm if PAUSED returns to REQUESETED, assuming we
> > > immediately change the state to PAUSE always we see REQUESTED in the
> > > waiting loop, despite that we allow change the state from PAUSE to
> > > REQUESTED via NOT_PAUSED between two successive loop condition checks?
> >
> > If a user call pg_wal_replay_pause while recovery is paused, users can
> > observe 'pause requested' during a sleep alghough the time window is short.
> > It seems a bit odd that pg_wal_replay_pause changes the state like this
> > because This state meeans that recovery may not be 'paused'.
>
> Yeah, this appears wrong that after 'paused' we go back to 'pause
> requested'. the logical state transition should always be as below
>
> NOT PAUSED -> PAUSE REQUESTED or PAUSED (maybe we should always go to
> request and then paused but there is nothing wrong with going to
> paused)
> PAUSE REQUESTED -> NOT PAUSE or PAUSED (either cancel the request or get paused)
> PAUSED -> NOT PAUSED (from PAUSED we should not go to the
> PAUSE_REQUESTED without going to NOT PAUSED)

I didn't asked about the internal logical correctness, but asked about
*actual harm* revealed to users. I don't see any actual harm in the
"wrong" transition because:

1. It is not wrong nor strange that the invoker of pg_wal_replay_pause
sees the state PAUSE_REQUESTED before it changes to PAUSED. Even if
the previous state was PAUSED, it is no business of the requestors.

2. It is no harm in the recovery side since PAUSE_REQUESTED and PAUSED
are effectively the same state.

3. After we inhibited the direct transition from
PAUSED->PAUSE_REQUESTED, the effectively the same transition
PAUSED->NOT_PAUSED->PAUSE_REQUESTED is still allowed. The inhibition
of the former transition doesn't protect anything other than seeming
correctness of the transition.

If we are going to introduce that complexity, I'd like to re-propose
to introduce interlocking between the recovery side and the
pause-requestor side instead of introducing the intermediate state,
which is the cause of the complexity.

The problem is due to the looseness of checking for pause requests in
the existing checkponts, and the window after the last checkpoint
until calling rm_redo().

The attached PoC patch adds:

- A solid checkpoint just before calling rm_redo. It doesn't add a
info_lck since the check is done in the existing lock section.

- Interlocking between the above and SetRecoveryPause without adding a
shared variable.
(This is what I called "synchronous" before.)

There's a concern about pausing after updating
XlogCtl->replayEndRecPtr but I don't see an issue yet..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
make_wal_replay_pause_synchronous.patch text/x-patch 2.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message kuroda.hayato@fujitsu.com 2021-02-09 02:12:37 RE: parse mistake in ecpg connect string
Previous Message osumi.takamichi@fujitsu.com 2021-02-09 01:37:17 RE: Single transaction in the tablesync worker?