Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

From: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date: 2022-05-27 19:01:37
Message-ID: E37D2989-CFB6-4302-A871-A007DC71DA25@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> So.. I'd like to hear exactly what you did as the testing.

> When standby mode is requested, if crash recovery fails with immature
> contrecord, I think we shouldn't continue recovery. But I'm not sure
> we need to explictly reject that case. Further study is needed..

Here is more details about my findings and testing.

Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff
we still see the issue with post promotion checkpoint
resulting in "request to flush past end of generated WAL;"

i.e.

2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time
2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060
2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060
2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator(at)postgres:[23433]:ERROR: current transaction is aborted, commands ignored until end of transaction block

The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff
was to make sure the standby skips the overwrite contrecord.

However, we still see "missingContrecPtr" is being set
on the standby before promotion and after the instance is
promoted, the missingContrecPtr is written to WAL
and the subsequent flush throws a "PANIC: xlog flush request"

To Reproduce using TAP tests;

1) apply the attached patch 0001-Patch_to_repro.patch to the head branch.
This patch adds logging when an OverWriteContRecord
is created and comments out the invalidation code
added in commit 9d92582abf918215d27659d45a4c9e78bda50aff

2) Run a tap test under "test/recovery"
make check PROVE_TESTS='t/026_overwrite_contrecord.pl'

3) What is observed in the logfiles for both the standby
and primary instance in the tap test is
that an overwrite contrecord is created on the primary,
which is correct, but also on the standby after promotion,
which is incorrect. This is incorrect as the contrecord

simseih(at)88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo'
2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000

simseih(at)88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo'
2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request
2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
simseih(at)88665a22795f recovery %

What we found:

1. missingContrecPtr is set when
StandbyMode is false, therefore
only a writer should set this value
and a record is then sent downstream.

But a standby going through crash
recovery will always have StandbyMode = false,
causing the missingContrecPtr to be incorrectly
set.

2. If StandbyModeRequested is checked instead,
we ensure that a standby will not set a
missingContrecPtr.

3. After applying the patch below, the tap test succeeded

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5ee90b6..a727aaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
* we'll write a record to indicate to downstream WAL readers that
* that portion is to be ignored.
*/
- if (!StandbyMode &&
+ if (!StandbyModeRequested &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;

So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff
but to check StandbyModeRequested = false before setting
missingContrecPtr.

Thank you
---
Sami Imseih
Amazon Web Services

Attachment Content-Type Size
0001-Patch_to_repro.patch application/octet-stream 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Cary Huang 2022-05-27 19:08:51 Re: Allowing REINDEX to have an optional name
Previous Message Andres Freund 2022-05-27 18:59:13 Re: suboverflowed subtransactions concurrency performance optimize