Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: dilipbalaut(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-05-21 02:21:05
Message-ID: 20210521.112105.27166595366072396.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 20 May 2021 13:49:10 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > Yeah, it will be a fake 1-element list. But just to be clear that
> > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> > nothing else, do you agree to this? Because we initialize
> > recoveryTargetTLI to this value and we might change it in
> > readRecoveryCommandFile() but for that, we must get the history file,
> > so if we are talking about the case where we don't have the history
> > file then "recoveryTargetTLI" will still be
> > "ControlFile->checkPointCopy.ThisTimeLineID".
>
> I don't think your conclusion is correct. As I understand it, you're
> talking about the state before
> ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now
> readRecoveryCommandFile() no longer exists in master. Before that
> commit, StartupXLOG did this:
>
> recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
> readRecoveryCommandFile();
> expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
>
> Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
> doing so, it will call existsTimeLineHistory if
> recovery_target_timeline was set to an integer, and findNewestTimeLine
> if it was set to latest. In the first case, existsTimeLineHistory()
> calls RestoreArchivedFile(), but that restores it using a temporary
> name, and KeepFileRestoredFromArchive is not called, so we might have
> the timeline history in RECOVERYHISTORY but that's not the filename
> we're actually going to try to read from inside readTimeLineHistory().
> In the second case, findNewestTimeLine() will call
> existsTimeLineHistory() which results in the same situation. So I
> think when readRecoveryCommandFile() returns expectedTLI can be
> different but the history file can be absent since it was only ever
> restored under a temporary name.

Anyway it seems that the commit tried to fix an issue happens without
using WAL archive.

https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com

> That leaves one case not covered: If you take a backup with plain
> "pg_basebackup" from a standby, without -X, and the first WAL segment
> contains a timeline switch (ie. you take the backup right after a
> failover), and you try to recover from it without a WAL archive, it
> doesn't work. This is the original issue that started this thread,
> except that I used "-x" in my original test case. The problem here is
> that even though streaming replication will fetch the timeline history
> file when it connects, at the very beginning of recovery, we expect that
> we already have the timeline history file corresponding the initial
> timeline available, either in pg_xlog or the WAL archive. By the time
> streaming replication has connected and fetched the history file, we've
> already initialized expectedTLEs to contain just the latest TLI. To fix
> that, we should delay calling readTimeLineHistoryFile() until streaming
> replication has connected and fetched the file.
> If the first segment read by recovery contains a timeline switch, the first
> pages have older timeline than segment timeline. However, if
> exepectedTLEs contained only the segment timeline, we cannot know if
> we can use the record. In that case the following error is issued.

If expectedTLEs is initialized with the pseudo list,
tliOfPointInHistory always return the recoveryTargetTLI regardless of
the given LSN so the checking about timelines later doesn't work. And
later ReadRecord is surprised to see a page of an unknown timeline.

"unexpected timeline ID 1 in log segment"

So the objective is to initialize expectedTLEs with the right content
of the history file for the recoveryTargetTLI until ReadRecord fetches
the first record. After the fix things are working as the following.

- recoveryTargetTimeLine is initialized with
ControlFile->checkPointCopy.ThisTimeLineID

- readRecoveryCommandFile():

Move recoveryTargetTLI forward to the specified target timline if
the history file for the timeline is found, or in the case of
latest, move it forward up to the maximum timeline among the history
files found in either pg_wal or archive.

!!! Anyway recoveryTargetTLI won't goes back behind the checkpoint
TLI.

- ReadRecord...XLogFileReadAnyTLI

Tries to load the history file for recoveryTargetTLI either from
pg_wal or archive onto local TLE list, if the history file is not
found, use a generateed list with one entry for the
recoveryTargetTLI.

(a) If the specified segment file for any timeline in the TLE list
is found, expectedTLEs is initialized with the local list. No need
to worry about expectedTLEs any longer.

(b) If such a segment is *not* found, expectedTLEs is left
NIL. Usually recoveryTargetTLI is equal to the last checkpoint
TLI.

(c) However, in the case where timeline switches happened in the
segment and the recoveryTargetTLI has been increased, that is, the
history file for the recoveryTargetTLI is found in pg_wal or
archive, that is, the issue raised here, recoveryTargetTLI becomes
the future timline of the checkpoint TLI.

(d) The history file for the recoveryTargetTLI is *not* found but
the segment file is found, expectedTLEs is initialized with the
generated list, which doesn't contain past timelines. In this
case, recoveryTargetTLI has not moved from the initial value of
the checkpoint TLI. If the REDO point is before a timeline switch,
the page causes FATAL in ReadRecord later. However, I think there
cannot be a case where segment file is found before corresponding
history file. (Except for TLI=1, which is no problem.)

- WaitForWALToBecomeAvailable

if we have had no segments for the last checkpoint, initiate
streaming from the REDO point of the last checkpoint. We should have
all history files until receiving segment data.

after sufficient WAL data has been received, the only cases where
expectedTLEs is still NIL are the (b) and (c) above.

In the case of (b) recoveryTargetTLI == checkpoint TLI.

In the case of (c) recoveryTargetTLI > checkpoint TLI. In this case
we expecte that checkpint TLI is in the history of
recoveryTargetTLI. Otherwise recovery failse. This case is similar
to the case (a) but the relationship between recoveryTargetTLI and
the checkpoint TLI is not confirmed yet. ReadRecord barks later if
they are not compatible so there's not a serious problem but might
be better checking the relation ship there. My first proposal
performed mutual check between the two but we need to check only
unidirectionally.

if (readFile < 0)
{
if (!expectedTLEs)
{
expectedTLEs = readTimeLineHistory(receiveTLI);
+ if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
+ ereport(ERROR, "the received timeline %d is not found in the history file for timeline %d");

> > Conclusion:
> > - I think now we agree on the point that initializing expectedTLEs
> > with the recovery target timeline is the right fix.
> > - We still have some differences of opinion about what was the
> > original problem in the base code which was fixed by the commit
> > (ee994272ca50f70b53074f0febaec97e28f83c4e).
>
> I am also still concerned about whether we understand in exactly what
> cases the current logic doesn't work. We seem to more or less agree on
> the fix, but I don't think we really understand precisely what case we
> are fixing.

Does the discussion above make sense?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-05-21 02:48:05 Fdw batch insert error out when set batch_size > 65535
Previous Message Andrew Dunstan 2021-05-21 01:59:33 Re: multi-install PostgresNode fails with older postgres versions