Re: Infinite loop in XLogPageRead() on standby

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: cyberdemn(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-03-01 04:16:37
Message-ID: 20240301.131637.563635299005880728.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> Anyway, our current policy here is to avoid record-rereads beyond
> source switches. However, fixing this seems to require that source
> switches cause record rereads unless some additional information is
> available to know if replay LSN needs to back up.

It seems to me that the error messages are related to commit 0668719801.

XLogPageRead:
> * Check the page header immediately, so that we can retry immediately if
> * it's not valid. This may seem unnecessary, because ReadPageInternal()
> * validates the page header anyway, and would propagate the failure up to
> * ReadRecord(), which would retry. However, there's a corner case with
> * continuation records, if a record is split across two pages such that
> * we would need to read the two pages from different sources. For
...
> if (StandbyMode &&
> !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
> {
> /*
> * Emit this error right now then retry this page immediately. Use
> * errmsg_internal() because the message was already translated.
> */
> if (xlogreader->errormsg_buf[0])
> ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
> (errmsg_internal("%s", xlogreader->errormsg_buf)));

This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.

Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-03-01 04:22:55 Re: Synchronizing slots from primary to standby
Previous Message John Naylor 2024-03-01 04:15:05 Re: [PoC] Improve dead tuple storage for lazy vacuum