Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

From: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date: 2022-02-25 21:28:18
Message-ID: EF51690F-5006-47C4-9473-9DFAFC0B1778@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> After some investigation, I finally concluded that we should reset
> abortedRecPtr and missingContrecPtr at processing
> XLOG_OVERWRITE_CONTRECORD.

> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
timestamptz_to_str(xlrec.overwrite_time))));

> + /* We have safely skipped the aborted record */
> + abortedRecPtr = InvalidXLogRecPtr;
> + missingContrecPtr = InvalidXLogRecPtr;
> +
> /* Verifying the record should only happen once */
> record->overwrittenRecPtr = InvalidXLogRecPtr;
> }

> The last check in the test against "resetting aborted record" is no
> longer useful since it is already checked by
> 026_verwrite_contrecord.pl.

> regards.

+1 for this. Resetting abortedRecPtr and missingContrecPtr after the broken record is skipped in is cleaner. I also think it would make sense to move the " successfully skipped missing contrecord" to after we invalidate the aborted and missing record pointers, like below.

+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1948,15 +1948,15 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
LSN_FORMAT_ARGS(record->overwrittenRecPtr));

+ /* We have safely skipped the aborted record */
+ abortedRecPtr = InvalidXLogRecPtr;
+ missingContrecPtr = InvalidXLogRecPtr;
+
ereport(LOG,
(errmsg("successfully skipped missing contrecord at %X/%X, overwritten at %s",
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
timestamptz_to_str(xlrec.overwrite_time))));

Also, instead of a new test, the current 026_overwrite_contrecord.pl test can include a promotion test at the end.

Attaching a new patch for review.

--
Sami Imseih
Amazon Web Services

Attachment Content-Type Size
v4-0001-Fix-missing-continuation-record-after-standby-promot.patch application/octet-stream 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2022-02-25 21:31:23 Re: why do hash index builds use smgrextend() for new splitpoint pages
Previous Message Andres Freund 2022-02-25 21:26:49 Re: [PATCH] Expose port->authn_id to extensions and triggers