Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: dilipbalaut(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-05-24 02:34:02
Message-ID: 20210524.113402.1922481024406047229.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 21 May 2021 12:52:54 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> I had trouble following it completely, but I didn't really spot
> anything that seemed definitely wrong. However, I don't understand
> what it has to do with where we are now. What I want to understand is:
> under exactly what circumstances does it matter that
> WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> will stream from receiveTLI rather than recoveryTargetTLI?

Extracing related descriptions from my previous mail,

- recoveryTargetTimeLine is initialized with
ControlFile->checkPointCopy.ThisTimeLineID

- readRecoveryCommandFile():
...or in the case of
latest, move it forward up to the maximum timeline among the history
files found in either pg_wal or archive.

- ReadRecord...XLogFileReadAnyTLI

Tries to load the history file for recoveryTargetTLI either from
pg_wal or archive onto local TLE list, if the history file is not
found, use a generateed list with one entry for the
recoveryTargetTLI.

(b) If such a segment is *not* found, expectedTLEs is left
NIL. Usually recoveryTargetTLI is equal to the last checkpoint
TLI.

(c) However, in the case where timeline switches happened in the
segment and the recoveryTargetTLI has been increased, that is, the
history file for the recoveryTargetTLI is found in pg_wal or
archive, that is, the issue raised here, recoveryTargetTLI becomes
the future timline of the checkpoint TLI.

- WaitForWALToBecomeAvailable

In the case of (c) recoveryTargetTLI > checkpoint TLI. In this case
we expecte that checkpint TLI is in the history of
recoveryTargetTLI. Otherwise recovery failse^h. This case is similar
to the case (a) but the relationship between recoveryTargetTLI and
the checkpoint TLI is not confirmed yet. ReadRecord barks later if
they are not compatible so there's not a serious problem but might
be better checking the relation ship there. My first proposal
performed mutual check between the two but we need to check only
unidirectionally.

===
So the condition for the Dilip's case is, as you wrote in another mail:

- ControlFile->checkPointCopy.ThisTimeLineID is in the older timeline.
- Archive or pg_wal offers the history file for the newer timeline.
- The segment for the checkpoint is not found in pg_wal nor in archive.

That is,

- A grandchild(c) node is stopped
- Then the child node(b) is promoted.

- Clear pg_wal directory of (c) then connect it to (b) *before* (b)
archives the segment for the newer timeline of the
timeline-switching segments. (if we have switched at segment 3,
TLI=1, the segment file of the older timeline is renamed to
.partial, then create the same segment for TLI=2. The former is
archived while promotion is performed but the latter won't be
archive until the segment ends.)

The orinal case of after the commit ee994272ca,

- recoveryTargetTimeLine is initialized with
ControlFile->checkPointCopy.ThisTimeLineID

(X) (Before the commit, we created the one-entry expectedTLEs consists
only of ControlFile->checkPointCopy.ThisTimeLineID.)

- readRecoveryCommandFile():

Move recoveryTargetTLI forward to the specified target timline if
the history file for the timeline is found, or in the case of
latest, move it forward up to the maximum timeline among the history
files found in either pg_wal or archive.

- ReadRecord...XLogFileReadAnyTLI

Tries to load the history file for recoveryTargetTLI either from
pg_wal or archive onto local TLE list, if the history file is not
found, use a generateed list with one entry for the
recoveryTargetTLI.

(b) If such a segment is *not* found, expectedTLEs is left
NIL. Usually recoveryTargetTLI is equal to the last checkpoint
TLI.

- WaitForWALToBecomeAvailable

if we have had no segments for the last checkpoint, initiate
streaming from the REDO point of the last checkpoint. We should have
all history files until receiving segment data.

after sufficient WAL data has been received, the only cases where
expectedTLEs is still NIL are the (b) and (c) above.

In the case of (b) recoveryTargetTLI == checkpoint TLI.

So I thought that the commit fixed this scenario. Even in this case,
ReadRecord fails because the checkpoint segment contains pages for the
older timeline which is not in expectedTLEs if we did (X).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2021-05-24 03:15:57 RE: Forget close an open relation in ReorderBufferProcessTXN()
Previous Message Yugo NAGATA 2021-05-24 02:29:10 Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors