PITR promote bug: Checkpointer writes to older timeline

From: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: kevin(dot)yeap(at)vmware(dot)com, Michael Paquier <michael(at)paquier(dot)xyz>, jyih(at)vmware(dot)com
Subject: PITR promote bug: Checkpointer writes to older timeline
Date: 2021-03-03 01:56:03
Message-ID: CAE-ML+_EjH_fzfq1F3RJ1=XaaNG=-Jz-i3JqkNhXiLAsM3z-Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

We came across an issue where the checkpointer writes to the older
timeline while a promotion is ongoing after reaching the recovery point
in a PITR, when there are prepared transactions before the recovery
point. We came across this issue first in REL_12_STABLE and saw that it
also exists in devel.

Setup:
PFA a minimal reproduction script repro.txt.

After running the script, we notice that the checkpointer has written
the end-of-recovery shutdown checkpoint in the previous timeline (tli =
1), i.e. it wrote into the WAL segment 000000010000000000000003 instead
of writing to the WAL segment 000000020000000000000003, causing it to
overwrite WAL records past the recovery point (please see attached diff
output file waldiff.diff) in 000000010000000000000003.

Also, performing a subsequent shutdown on the recovered server may cause
the next shutdown checkpoint record to be written, again, to the
previous timeline, i.e. to 000000010000000000000003. A subsequent server
start will fail as the starup process will be unable to find the
checkpoint in the latest timeline (000000020000000000000003) and we will
get:

...
LOG: invalid record length at 0/3016FB8: wante
d 24, got 0
LOG: invalid primary checkpoint record
PANIC: could not locate a valid checkpoint record
...

RCA:

When there are prepared transactions in an older timeline, in the
checkpointer, a call to CheckPointTwoPhase() and subsequently to
XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
to the following line:

read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);

GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
the two phase WAL records in the older timeline. This variable will
remain unchanged and the checkpointer ends up writing the checkpoint
record into the older WAL segment (when XLogBeginInsert() is called
within CreateCheckPoint(), the value is still 1). The value is not
synchronized as even if RecoveryInProgress() is called,
xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
(SharedRecoveryInProgress = true in older versions) as the startup
process waits for the checkpointer inside RequestCheckpoint() (since
recovery_target_action='promote' involves a non-fast promotion). Thus,
InitXLOGAccess() is not called and the value of ThisTimeLineID is not
updated before the checkpoint record write.

Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
instead of a local variable, within read_local_xlog_page().

PFA a small patch that fixes the problem by explicitly calling
InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
is read, in order to update ThisTimeLineID to the latest timeline. It is
okay to call InitXLOGAccess() as it is lightweight and would mostly be
a no-op.

Regards,
Soumyadeep, Kevin and Jimmy
VMWare

Attachment Content-Type Size
repro.txt text/plain 1.3 KB
0001-Prevent-checkpointer-from-writing-to-older-timeline.patch text/x-patch 1.1 KB
waldiff.diff text/x-patch 1.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-03-03 02:01:58 Re: Removing vacuum_cleanup_index_scale_factor
Previous Message Michael Paquier 2021-03-03 01:49:49 Re: Add --tablespace option to reindexdb