A failure of standby to follow timeline switch

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: A failure of standby to follow timeline switch
Date: 2020-12-09 08:43:14
Message-ID: 20201209.174314.282492377848029776.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

We found a behavioral change (which seems to be a bug) in recovery at
PG13.

The following steps might seem somewhat strange but the replication
code deliberately cope with the case. This is a sequense seen while
operating a HA cluseter using Pacemaker.

- Run initdb to create a primary.
- Set archive_mode=on on the primary.
- Start the primary.

- Create a standby using pg_basebackup from the primary.
- Stop the standby.
- Stop the primary.

- Put stnadby.signal to the primary then start it.
- Promote the primary.

- Start the standby.

Until PG12, the parimary signals end-of-timeline to the standby and
switches to the next timeline. Since PG13, that doesn't happen and
the standby continues to request for the segment of the older
timeline, which no longer exists.

FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003 has already been removed

It is because WalSndSegmentOpen() can fail to detect a timeline switch
on a historic timeline, due to use of a wrong variable to check
that. It is using state->seg.ws_segno but it seems to be a thinko when
the code around was refactored in 709d003fbd.

The first patch detects the wrong behavior. The second small patch
fixes it.

In the first patch, the test added to 001_stream_rep.pl involves two
copied functions related to server-log investigation from
019_repslot_limit.pl.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Add-a-new-test-to-detect-a-replication-bug.patch text/x-patch 2.5 KB
0002-Fix-a-bug-of-timeline-tracking-while-sending-a-histo.patch text/x-patch 1.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2020-12-09 09:07:12 Re: [Proposal] Global temporary tables
Previous Message Michael Paquier 2020-12-09 07:55:04 Re: Occasional tablespace.sql failures in check-world -jnn