Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: noah(at)leadboat(dot)com
Cc: thomas(dot)munro(at)gmail(dot)com, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-07-11 06:54:14
Message-ID: 20230711.155414.939136339819647605.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
> > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
> > > 2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1
> >
> > Thanks Alexander. Looking into this. I think it is probably
> > something like: recycled standby pages are not zeroed (something we
> > already needed to do something about[1]), and when we read a recycled
> > garbage size (like your "xxxx") at the end of a page at an offset
> > where we don't have a full record header on one page, we skip the
> > ValidXLogRecordHeader() call (and always did), but the check in
> > allocate_recordbuf() which previously handled that "gracefully" (well,
> > it would try to allocate up to 1GB bogusly, but it wouldn't try to
> > allocate more than that and ereport) is a bit too late. I probably
> > need to add an earlier not-too-big validation. Thinking.
>
> I agree about an earlier not-too-big validation. Like the attached? I
> haven't tested it with Alexander's recipe or pondered it thoroughly.

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages. However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

> > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
>
> Regarding [1], is it still worth zeroing recycled pages on standbys and/or
> reading the whole header before allocating xl_tot_len? (Are there benefits
> other than avoiding a 1G backend allocation or 4G frontend allocation, or is
> that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

regards.

Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
write_finalize_tot_len.diff text/x-patch 571 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Richard Veselý 2023-07-11 09:31:06 RE: BUG #18016: REINDEX TABLE failure
Previous Message suyu.cmj 2023-07-11 02:35:15 Got FATAL in lock_twophase_recover() during recovery