Re: Infinite loop in XLogPageRead() on standby

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: cyberdemn(at)gmail(dot)com
Cc: michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-03-06 08:57:44
Message-ID: 20240306.175744.2104302179933900645.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin <cyberdemn(at)gmail(dot)com> wrote in
> Please find attached the patch fixing the problem and the updated TAP test
> that addresses Nit.

Record-level retries happen when the upper layer detects errors. In my
previous mail, I cited code that is intended to prevent this at
segment boundaries. However, the resulting code applies to all page
boundaries, since we judged that the difference doen't significanty
affects the outcome.

> * Check the page header immediately, so that we can retry immediately if
> * it's not valid. This may seem unnecessary, because ReadPageInternal()
> * validates the page header anyway, and would propagate the failure up to

So, the following (tentative) change should also work.

xlogrecovery.c:
@@ -3460,8 +3490,10 @@ retry:
* responsible for the validation.
*/
if (StandbyMode &&
+ targetPagePtr % 0x100000 == 0 &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{

Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-03-06 08:59:24 Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c
Previous Message Amit Kapila 2024-03-06 08:53:08 Re: Synchronizing slots from primary to standby