Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: dilipbalaut(at)gmail(dot)com
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-05-07 02:53:02
Message-ID: 20210507.115302.1371322012690532443.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote in
> I think the fix for the problem will be that, after reading/validating
> the checkpoint record, we can free the current value of expectedTLEs
> and reinitialize it based on the recoveryTargetTLI as shown in the
> attached patch?

I'm not sure I understand the issue here. I think that the attached
should reproduce the issue mentioned here, but didn't for me.

The result of running the attached test script is shown below. TLIs
are adjusted in your descriptions cited below. The lines prefixed by
NodeN> are the server log lines written while running the attached
test script.

> 1. Node2 got promoted to new primary, and node 2 picked new TL 2 in
> the middle of the segment 3.

Node2> LOG: selected new timeline ID: 2

> 2. Node3, restarted with new primary info of Node2 and restore command

Node2> node_3 LOG: received replication command: IDENTIFY_SYSTEM

> 3. Node3, found the newest TL2 in validateRecoveryParameters() Because
> the latest TL was requested in recovery.conf (history file restored
> from TL2) and set recoveryTargetTLI to 2 So point to note is
> recoveryTargetTLI is set to 2 but expectedTLEs is not yet set.

This means you specified recovery_target_timeline? Either way,
expectedTLEs is not relevant to the behavior here. Even if
recovery_target_timeline is set to latest, findNewestTimeLine doesn't
look it.

Node3> LOG: restored log file "00000002.history" from archive

> 4. Node3, entered into the standby mode.

Node3> LOG: entering standby mode

> 5. Node3, tries to read the checkpoint Record, on Node3 still the
> checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 1.

expectedTLEs is loaded just before fetching the last checkpoint.

ReadCheckpointRecord doesn't consider checkPointCopy.ThisTimeLineID.

The reason for the checkpoint TLI is that the segment file was that of
the newest TLI in expectedTLEs found in pg_wal directory. If the
segment for TLI=2 containing the last checkpoint had been archived,
checkpoint record would be read as TLI=2. Replication starts at TLI=2
in this case because archive recovery has reached that timeline.
(Turn on the optional section in the attached test script to see this
behavior.) This is the expected behavior since we assume that the
segment files for TLI=n and n+1 are identical in the TLI=n part.

Anyway the checkpoint that is read is on TLI=1 in this case and
replication starts at TLI=1.

Node3> LOG: Checkpoint record: TLI=1, 0/3014F78

> 6. Node3, tries to get the checkpoint record file using new TL2 from
> the archive which it should get ideally but it may not if the Node2
> haven't yet archived it.

This doesn't happen for me. Instead, node3 runs recovery from the
checkpoint up to the end of the archived WAL. In this case the end
point is 3014FF0(at)TLI=1(dot)

Node3> LOG: invalid record length at 0/3014FF0: wanted 24, got 0

Then, node3 connects to node2 requesting TLI=1 because the history
file (or expectedTLEs) told that the LSN belongs to TLI=1.

Node3> LOG: 0/3014FF0 is on TLI 1
Node3> LOG: started streaming WAL from primary at 0/3000000 on timeline 1

After a while node2 finds a timeline switch and disconnects the
replication.

Node3> LOG: replication terminated by primary server
Node3> DETAIL: End of WAL reached on timeline 1 at 0/3029A68.

After scanning the archive and pg_wal ends in failure, node3 correctly
requests node2 for TLI=2 because expectedTLEs told that the current
LSN belongs to TLI=2.

Node3> LOG: 0/3029A68 is on TLI 2
Node3> LOG: restarted WAL streaming at 0/3000000 on timeline 2

Finally the items below don't happen for me, because node3 needs not
to go back to the last checkpoint any longer. Perhaps the script is
failing to reproduce your issue correctly.

> 7. Node3, tries to stream from primary but using TL1 because
> ControlFile->checkPointCopy.ThisTimeLineID is 1.

As mentioned above, the checkPointCopy.ThisTimeLineID on either the
primary and secondary is irrelevant to the timline the primary
sends. The primary streams the timeline requested by the secondary.

> 8. Node3, get it because walsender of Node2 read it from TL2 and send
> it and Node2 write in the new WAL file but with TL1.

Walsender strems the requested TLI from walreceiver, then disconnects
at the end of the TLI notifying node3 of the next TLI. Node3
re-establishes replication with the new TLI. Looking into pg_wal of
node3, segment 3 for both TLI=1 and 2 are filled by the correct
content.

So,,, I don't understand what are you saying is the race condition..

An issue that may be slightly relevant to this case have been raised
[1]. But it is about writing end-of-recovery checkpoint into the older
timeline.

Could you please fix the test script so that it causes your issue
correctly? And/or elaborate a bit more?

The attached first file is the debugging aid logging. The second is
the test script, to be placed in src/test/recovery/t.

1: https://www.postgresql.org/message-id/CAE-ML%2B_EjH_fzfq1F3RJ1%3DXaaNG%3D-Jz-i3JqkNhXiLAsM3z-Ew%40mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-05-07 03:18:37 Re: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication
Previous Message Amit Kapila 2021-05-07 02:33:42 Re: Replication slot stats misgivings