Re: Incorrect handling of OOM in WAL replay leading to data loss

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject: Re: Incorrect handling of OOM in WAL replay leading to data loss
Date: 2023-09-26 06:48:07
Message-ID: ZRJ-p1dLUY0uoChc@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2023 at 02:59:07PM +0900, Michael Paquier wrote:
> My apologies if I sounded unclear here. It seems to me that we should
> wrap the patch on [1] first, and get it backpatched. At least that
> makes for less conflicts when 0001 gets merged for HEAD when we are
> able to set a proper error code. (Was looking at it, actually.)

Now that Thomas Munro has addressed the original problem to be able to
trust correctly xl_tot_len with bae868caf22, I am coming back to this
thread.

First, attached is a rebased set:
- 0001 to introduce the new error infra for xlogreader.c with an error
code, so as callers can make the difference between an OOM and an
invalid record.
- 0002 to tweak the startup process. Once again, I've taken the
approach to make the startup process behave like a standby on crash
recovery: each time that an OOM is found, we loop and retry.
- 0003 to emulate an OOM failure, that can be used with the script
attached to see that we don't stop recovery too early.

>>> I guess that it depends on how much responsiveness one may want.
>>> Forcing a failure on OOM is at least something that users would be
>>> immediately able to act on when we don't run a standby but just
>>> recover from a crash, while a standby would do what it is designed to
>>> do, aka continue to replay what it can see. One issue with the
>>> wait-and-continue is that a standby may loop continuously on OOM,
>>> which could be also bad if there's a replication slot retaining WAL on
>>> the primary. Perhaps that's just OK to keep doing that for a
>>> standby. At least this makes the discussion easier for the sake of
>>> this thread: just consider the case of crash recovery when we don't
>>> have a standby.
>>
>> Yeah, I'm with you on focusing on crash recovery cases; that's what I
>> intended. Sorry for any confusion.
>
> Okay, so we're on the same page here, keeping standbys as they are and
> do something for the crash recovery case.

For the crash recovery case, one argument that stood out in my mind is
that causing a hard failure has the disadvantage to force users to do
again WAL replay from the last redo position, which may be far away
even if the checkpointer now runs during crash recovery. What I am
proposing on this thread has the merit to avoid that. Anyway, let's
discuss more before settling this point for the crash recovery case.

By the way, anything that I am proposing here cannot be backpatched
because of the infrastructure changes required in walreader.c, so I am
going to create a second thread with something that could be
backpatched (yeah, likely FATALs on OOM to stop recovery from doing
something bad)..
--
Michael

Attachment Content-Type Size
v4-0001-Add-infrastructure-to-report-error-codes-in-WAL-r.patch text/x-diff 41.9 KB
v4-0002-Make-WAL-replay-more-robust-on-OOM-failures.patch text/x-diff 4.5 KB
v4-0003-Tweak-to-force-OOM-behavior-when-replaying-record.patch text/x-diff 2.4 KB
xlogreader_oom.bash text/plain 942 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-09-26 06:48:31 Re: Regression in COPY FROM caused by 9f8377f7a2
Previous Message Fujii.Yuki@df.MitsubishiElectric.co.jp 2023-09-26 06:26:25 RE: Partial aggregates pushdown