Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: dilipbalaut(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-05-14 05:24:30
Message-ID: 20210514.142430.450479701172804252.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 14 May 2021 14:12:31 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> > On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > > It seems to me the issue here is not a race condition but
> > > WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> > > of a improper timeline. So using recoveryTargetTLI instead of
> > > receiveTLI for the case fixes this issue.
> >
> > I agree.
> >
> > > I believe the 004_timeline_switch.pl detects your issue. And the
> > > attached change fixes it.
> >
> > So why does this use recoveryTargetTLI instead of receiveTLI only
> > conditionally? Why not do it all the time?
>
> The commit ee994272ca apparently says that there's a case where primary

This is not an incomplete line but just a garbage.

> > The hard thing about this code is that the assumptions are not very
> > clear. If we don't know why something is a certain way, then we might
> > break things if we change it. Worse yet, if nobody else knows why it's
> > like that either, then who knows what assumptions they might be
> > making? It's hard to be sure that any change is safe.
>
> Thanks for the comment.
>
> > But that being said, we have a clear definition from the comments for
> > what expectedTLEs is supposed to contain, and it's only going to end
> > up with those contents if we initialize it from recoveryTargetTLI. So
> > I am inclined to think that we ought to do that always, and if it
>
> Yes, I also found it after that, and agreed. Desynchronization
> between recoveryTargetTLI and expectedTLEs prevents
> rescanLatestTimeline from working.
>
> > breaks something, then that's a sign that some other part of the code
> > also needs fixing, because apparently that hypothetical other part of
> > the code doesn't work if expctedTLEs contains what the comments say
> > that it should.
>
> After some more inspection, I'm now also sure that it is a
> typo/thinko. Other than while fetching the first checkpoint,
> receivedTLI is always in the history of recoveryTargetTLI, otherwise
> recoveryTargetTLI is equal to receiveTLI.
>
> I read that the commit message as "waiting for fetching possible
> future history files to know if there's the future for the current
> timeline. However now I read it as "don't bother expecting for
> possiblly-unavailable history files when we're reading the first
> checkpoint the timeline for which is already known to us.". If it is
> correct we don't bother considering future history files coming from
> primary there.
>
> > Now maybe that's the wrong idea. But if so, then we're saying that the
> > definition of expectedTLEs needs to be changed, and we should update
> > the comments with the new definition, whatever it is. A lot of the
> > confusion here results from the fact that the code and comments are
> > inconsistent and we can't tell whether that's intentional or
> > inadvertent. Let's not leave the next person who looks at this code
> > wondering the same thing about whatever changes we make.
>
> Ok. The reason why we haven't have a complain about that would be
> that it is rare that pg_wal is wiped out before a standby connects to
> a just-promoted primary. I'm not sure about the tool Dilip is using,
> though..
>
> So the result is the attached. This would be back-patcheable to 9.3
> (or 9.6?) but I doubt that we should do as we don't seem to have had a
> complaint on this issue and we're not full faith on this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-05-14 05:27:34 Re: OOM in spgist insert
Previous Message Dilip Kumar 2021-05-14 05:13:20 Re: parallel vacuum - few questions on docs, comments and code