Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: dilipbalaut(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-05-14 05:12:31
Message-ID: 20210514.141231.1686827794964357507.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > It seems to me the issue here is not a race condition but
> > WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> > of a improper timeline. So using recoveryTargetTLI instead of
> > receiveTLI for the case fixes this issue.
>
> I agree.
>
> > I believe the 004_timeline_switch.pl detects your issue. And the
> > attached change fixes it.
>
> So why does this use recoveryTargetTLI instead of receiveTLI only
> conditionally? Why not do it all the time?

The commit ee994272ca apparently says that there's a case where primary

> The hard thing about this code is that the assumptions are not very
> clear. If we don't know why something is a certain way, then we might
> break things if we change it. Worse yet, if nobody else knows why it's
> like that either, then who knows what assumptions they might be
> making? It's hard to be sure that any change is safe.

Thanks for the comment.

> But that being said, we have a clear definition from the comments for
> what expectedTLEs is supposed to contain, and it's only going to end
> up with those contents if we initialize it from recoveryTargetTLI. So
> I am inclined to think that we ought to do that always, and if it

Yes, I also found it after that, and agreed. Desynchronization
between recoveryTargetTLI and expectedTLEs prevents
rescanLatestTimeline from working.

> breaks something, then that's a sign that some other part of the code
> also needs fixing, because apparently that hypothetical other part of
> the code doesn't work if expctedTLEs contains what the comments say
> that it should.

After some more inspection, I'm now also sure that it is a
typo/thinko. Other than while fetching the first checkpoint,
receivedTLI is always in the history of recoveryTargetTLI, otherwise
recoveryTargetTLI is equal to receiveTLI.

I read that the commit message as "waiting for fetching possible
future history files to know if there's the future for the current
timeline. However now I read it as "don't bother expecting for
possiblly-unavailable history files when we're reading the first
checkpoint the timeline for which is already known to us.". If it is
correct we don't bother considering future history files coming from
primary there.

> Now maybe that's the wrong idea. But if so, then we're saying that the
> definition of expectedTLEs needs to be changed, and we should update
> the comments with the new definition, whatever it is. A lot of the
> confusion here results from the fact that the code and comments are
> inconsistent and we can't tell whether that's intentional or
> inadvertent. Let's not leave the next person who looks at this code
> wondering the same thing about whatever changes we make.

Ok. The reason why we haven't have a complain about that would be
that it is rare that pg_wal is wiped out before a standby connects to
a just-promoted primary. I'm not sure about the tool Dilip is using,
though..

So the result is the attached. This would be back-patcheable to 9.3
(or 9.6?) but I doubt that we should do as we don't seem to have had a
complaint on this issue and we're not full faith on this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v2-0001-Set-expectedTLEs-correctly-based-on-recoveryTarge.patch text/x-patch 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-05-14 05:13:20 Re: parallel vacuum - few questions on docs, comments and code
Previous Message Dilip Kumar 2021-05-14 04:59:07 Re: Race condition in recovery?