Re: Race condition in recovery?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-14 22:28:12
Message-ID: CA+TgmoY8=PtmyDWLP3gG59pifbtp3gwKfKYz1FpfqmKF25PCqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 14, 2021 at 12:59 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I am not sure that have you noticed the commit id which changed the
> definition of expectedTLEs, Heikki has committed that change so adding
> him in the list to know his opinion.

I did notice, but keep in mind that this was more than 8 years ago.
Even if Heikki is reading this thread, he may not remember why he
changed 1 line of code one way rather than another in 2013. I mean if
he does that's great, but it's asking a lot.

> =====
> ee994272ca50f70b53074f0febaec97e28f83c4e
> Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi> 2013-01-03 14:11:58
> Committer: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi> 2013-01-03 14:11:58
>
> Delay reading timeline history file until it's fetched from master.
> .....
> Without the timeline history file, recovering that file
> will fail as the older timeline ID is not recognized to be an ancestor of
> the target timeline. If you try to recover from such a backup, using only
> streaming replication to fetch the WAL, this patch is required for that to
> work.
> =====
>
> Part of this commit message says that it will not identify the
> recoveryTargetTLI as the ancestor of the checkpoint timeline (without
> history file). I did not understand what it is trying to say. Does
> it is trying to say that even though the recoveryTargetTLI is the
> ancestor of the checkpoint timeline but we can not track that because
> we don't have a history file? So to handle this problem change the
> definition of expectedTLEs to directly point to the checkpoint
> timeline?
>
> Because before this commit, we were directly initializing expectedTLEs
> with the history file of recoveryTargetTLI, we were not even waiting
> for reading the checkpoint, but under this commit, it is changed.

Well, I think that is talking about what the commit did in general,
not specifically the one line of code that we think may be incorrect.
As I understand it, the general issue here was that if
XLogFileReadAnyTLI() was called before expectedTLEs got set, then
prior to this commit it would have to fail, because the foreach() loop
in that function would be iterating over an empty list. Heikki tried
to make it not fail in that case, by setting tles =
readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
*wouldn't* get an empty list.

Thinking about this a bit more, I think the idea behind the logic this
commit added to XLogFileReadAnyTLI() is that
XLogFileReadAnyTLI(recoveryTargetTLI) may or may not produce the
correct answer. If the timeline history file exists, it will contain
all the information that we need and will return a complete list of
TLEs. But if the file does not exist yet, then it will return a
1-entry list saying that the TLI in question has no parents. If
readTimeLineHistory() actually reads the file, then it's safe to save
the return value in expectedTLEs, but if it doesn't, then it may or
may not be safe. If XLogFileReadAnyTLI calls XLogFileRead and it
works, then the WAL segment we need exists on our target timeline and
we don't actually need the timeline history for anything because we
can just directly begin replay from the target timeline. But if
XLogFileRead fails with the 1-entry dummy list, then we need the
timeline history and don't have it yet, so we have to retry later,
when the history file will hopefully be present, and then at that
point readTimeLineHistory will return a different and better answer
and hopefully it will all work.

I think this is what the commit message is talking about when it says
that "Without the timeline history file, recovering that file will
fail as the older timeline ID is not recognized to be an ancestor of
the target timeline." Without the timeline history file, we can't know
whether some other timeline is an ancestor or not. But the specific
way that manifests is that XLogFileReadAnyTLI() returns a 1-entry
dummy list instead of the real contents of the timeline history file.
This commit doesn't prevent that from happening, but it does prevent
the 1-entry dummy list from getting stored in the global variable
expectedTLEs, except in the case where no timeline switch is occurring
and the lack of history therefore doesn't matter. Without this commit,
if the call to readTimeLineHistory(recoveryTargetTLI) happens at a
time when the timeline history file is not yet available, the 1-entry
dummy list ends up in the global variable and there's no way for it to
ever be replaced with a real history even if the timeline history file
shows up in the archive later.

As I see it, the question on the table here is whether there's any
justification for the fact that when the second switch in
WaitForWALToBecomeAvailable takes the
XLOG_FROM_ARCHIVE/XLOG_FROM_PG_WAL branch, it calls XLogFileReadAnyTLI
which tries to read the history of recoveryTargetTLI, while when that
same switch takes the XLOG_FROM_STREAM branch, it tries to read the
history of receiveTLI. I tend to think it doesn't make sense. On
general principle, archiving and streaming are supposed to work the
same way, so the idea that they are getting the timeline from
different places is inherently suspect. But also and more
specifically, AFAICS receiveTLI always has to be the same TLI that we
requested from the server, so we're always looking up our own current
TLI here rather than the target TLI, which seems wrong to me, at least
of this moment. :-)

But that having been said, I still don't quite understand the
conditions required to tickle this problem. I spent a long time poking
at it today. It seems to me that it ought somehow to be possible to
recreate the scenario without trying to reuse the old master as a
standby, and also without even needing a WAL archive, but I couldn't
figure out how to do it. I tried setting up a primary and a standby,
and then making a backup from the standby, promoting it, and then
starting what would have been a cascading standby from the backup. But
that doesn't do it. The first mistake I made was creating the standbys
with something like 'pg_basebackup -R', but that's not good enough
because then they have WAL, so I added '-Xnone'. But then I realized
that when a base backup ends, the primary writes an XLOG_SWITCH
record, which means that when the standby is promoted, the promotion
is not in the same WAL segment as the checkpoint from which the
machine that would have been a cascading standby is trying to start. I
worked around that by setting recovery_target='immediate' on the
standby. With that change, I get a WAL file on the new timeline - 2 in
this case - that looks like this:

rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/19000060, prev 0/19000028, desc: CHECKPOINT_ONLINE redo 0/19000028;
tli 1; prev tli 1; fpw true; xid 0:587; oid 16385; multi 1; offset 0;
oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 587; online
rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn:
0/190000D8, prev 0/19000060, desc: BACKUP_END 0/19000028
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/19000100, prev 0/190000D8, desc: CHECKPOINT_SHUTDOWN redo
0/19000100; tli 2; prev tli 1; fpw true; xid 0:587; oid 16385; multi
1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

That sure looks like the right thing to recreate the problem, because
the first checkpoint is from the backup, and the
woulda-been-a-cascading-standby should be starting there, and the
second checkpoint is in the same segment and shows a timeline switch.
But everything worked great:

2021-05-14 17:44:58.684 EDT [5697] DETAIL: End of WAL reached on
timeline 1 at 0/19000100.
2021-05-14 17:44:58.728 EDT [5694] LOG: new target timeline is 2
2021-05-14 17:44:58.746 EDT [5694] LOG: redo starts at 0/19000028
2021-05-14 17:44:58.749 EDT [5694] LOG: consistent recovery state
reached at 0/19000100

I don't understand why that works. It feels to me like it ought to run
smack into the same problem you saw, but it doesn't.

> I am referring to the below code which was deleted by this commit:
>
> ========
> @@ -5279,49 +5299,6 @@ StartupXLOG(void)
> */
> readRecoveryCommandFile();
>
> - /* Now we can determine the list of expected TLIs */
> - expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
> -
> - /*
> - * If the location of the checkpoint record is not on the expected
> - * timeline in the history of the requested timeline, we cannot proceed:
> - * the backup is not part of the history of the requested timeline.
> - */
> - if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
> - ControlFile->checkPointCopy.ThisTimeLineID)
> - {
> =========

I don't think this code is really deleted. The tliOfPointInHistory
check was just moved later in the function. And expectedTLEs is still
supposed to be getting initialized, because just before the new
location of the tliOfPointInHistory check, Heikki added
Assert(expectedTLEs).

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-05-14 22:32:44 Re: Possible memory corruption (src/timezone/zic.c b/src/timezone/zic.c)
Previous Message Ranier Vilela 2021-05-14 21:43:24 Possible memory corruption (src/timezone/zic.c b/src/timezone/zic.c)