Fix primary crash continually with invalid checkpoint after promote

From: Zhao Rui <875941708(at)qq(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Fix primary crash continually with invalid checkpoint after promote
Date: 2022-04-26 07:16:13
Message-ID: tencent_1D53DA1DFA2F5EF11B0D6B9DD24FF8BD4A08@qq.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Newly promoted primary may leave an invalid checkpoint.

In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record".

The following timeline reproduces above situation:

tl1: standby begins to create restart point (time or wal triggered).

tl2: standby promotes and control file state is updated to DB_IN_PRODUCTION. Control file will not update (xlog.c:9690). But old wals are still removed (xlog.c:9719).

tl3: standby becomes primary. primary may crash before the next complete checkpoint (OOM in my situation). primary will crash continually with invalid checkpoint.

The attached patch reproduces this problem using standard postgresql perl test, you can run with&nbsp;

./configure --enable-tap-tests;&nbsp;make -j;&nbsp;make -C src/test/recovery/ check PROVE_TESTS=t/027_invalid_checkpoint_after_promote.pl

The attached patch also fixes this problem by ensuring that remove old wals only after control file is updated.

Attachment Content-Type Size
0001-Fix-primary-crash-continually-with-invalid-checkpoint-after-promote.patch application/octet-stream 5.1 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Federico Travaglini 2022-04-26 07:45:40 R: 14.1 immutable function, bad performance if check number = 'NaN'
Previous Message Julien Rouhaud 2022-04-26 04:54:35 Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2022-04-26 07:18:20 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Laurenz Albe 2022-04-26 06:26:59 Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication