Re: Race condition in recovery?

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-18 05:33:30
Message-ID: CAFiTN-t5L5PYbZW5K1OcdK0qcfgvi570KBTjReZPwYgVz4A56g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 18, 2021 at 1:28 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Sorry, you're right. It couldn't be uninitialized, but it could be a
> fake 1-element list saying there are no ancestors rather than the real
> value. So I think the point was to avoid that.

Yeah, it will be a fake 1-element list. But just to be clear that
1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
nothing else, do you agree to this? Because we initialize
recoveryTargetTLI to this value and we might change it in
readRecoveryCommandFile() but for that, we must get the history file,
so if we are talking about the case where we don't have the history
file then "recoveryTargetTLI" will still be
"ControlFile->checkPointCopy.ThisTimeLineID".

>
> I think the issue here is: If expectedTLEs was initialized before the
> history file was available, and contained a dummy 1-element list, then
> tliOfPointInHistory() is going to say that every LSN is on that
> timeline rather than any previous timeline. And if we are supposed to
> be switching timelines then that will lead to this sanity check
> failing.

You are talking about the sanity check of validating the timeline of
the checkpoint record right? but as I mentioned earlier the only
entry in expectedTLEs will be the TLE of the checkpoint record so how
the sanity check will fail?

>
> I agree, but that's actually bad, isn't it?

Yes, it is bad.

I mean if we want the
> sanity check to never fail we can just take it out. What we want to
> happen is that the sanity check should pass if the startup timeline if
> the TLI of the startup checkpoint is in the history of the recovery
> target timeline, but fail if it isn't. The only way to achieve that
> behavior is if expectedTLEs is initialized from the recovery target
> timeline.

Yes, I agree, with this. So initializing expectedTLEs with the
recovery target timeline is the right fix.

Conclusion:
- I think now we agree on the point that initializing expectedTLEs
with the recovery target timeline is the right fix.
- We still have some differences of opinion about what was the
original problem in the base code which was fixed by the commit
(ee994272ca50f70b53074f0febaec97e28f83c4e).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-05-18 05:42:25 Re: Testing autovacuum wraparound (including failsafe)
Previous Message Masahiko Sawada 2021-05-18 05:28:53 Re: Testing autovacuum wraparound (including failsafe)