Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: simseih(at)amazon(dot)com, alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date: 2022-06-28 07:09:26
Message-ID: 20220628.160926.1646442754540928448.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'd like to look into the WAL segments related to the failure.

Mmm... With the patch, xlogreader->abortedRecPtr is valid only and
always when the last read failed record was an aborted contrec. If
recovery ends here the first insereted record is an "aborted contrec"
record. I still see it as the only chance that an aborted contrecord
is followed by a non-"aborted contrec" record is that recovery somehow
fetches two consecutive WAL segments that are inconsistent at the
boundary.

I found the reason that the TAP test doesn't respond to the first
proposed patch (the below).

- if (!StandbyMode &&
+ if (!StandbyModeRequested &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))

The cause was that I disabled standby-mode in the test. The change
affects only while standby mode is on, which was to make the test
reliable and simpler. The first attached detects the same error (in a
somwhat maybe-unstable way) and responds to the fix above, and also
responds to the aborted_contrec_reset_3.patch.

So, aborted_contrec_reset_3 looks closer to the issue than before.

Would you mind trying the second attached to abtain detailed log on
your testing environment? With the patch, the modified TAP test yields
the log lines like below.

2022-06-28 15:49:20.661 JST [165472] LOG: ### [A] @0/1FFD338: abort=(0/1FFD338)0/0, miss=(0/2000000)0/0, SbyMode=0, SbyModeReq=1
...
2022-06-28 15:49:20.681 JST [165472] LOG: ### [F] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000, SbyMode=1, SbyModeReq=1
...
2022-06-28 15:49:20.767 JST [165472] LOG: ### [S] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000, SbyMode=0, SbyModeReq=1
...
2022-06-28 15:49:20.777 JST [165470] PANIC: xlog flush request 0/2094610 is not satisfied --- flushed only to 0/2000088

In this example, abortedRecPtr is set at the first line and recovery
continued to 2094610 but abortedRecPtr is not reset then PANICed. ([A]
means aborted contrec falure. [F] and [S] are failed and succeeded
reads respectively.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
detect_aborted_contrec_panic_2.diff text/x-patch 1.8 KB
abortcont_additional_log.diff text/x-patch 959 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2022-06-28 07:18:05 Re: SYSTEM_USER reserved word implementation
Previous Message Noah Misch 2022-06-28 06:37:19 Re: First draft of the PG 15 release notes