Quick Links

Re: Incorrect handling of OOM in WAL replay leading to data loss

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject:	Re: Incorrect handling of OOM in WAL replay leading to data loss
Date:	2023-08-10 04:33:48
Message-ID:	ZNRorM9fZL1BYtDy@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
>> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
>>>> While it's a kind of bug in total, we encountered a case where an
>>>> excessively large xl_tot_len actually came from a corrupted
>>>> record. [1]
>>>
>>> Right, I remember this one. I think that Thomas was pretty much right
>>> that this could be caused because of a lack of zeroing in the WAL
>>> pages.
>>
>> We have treated every kind of broken data as end-of-recovery, like
>> incorrect rm_id or prev link including excessively large record length
>> due to corruption. This patch is going to change the behavior only for
>> the last one. If you think there can't be non-zero broken data, we
>> should inhibit proceeding recovery after all non-zero incorrect
>> data. This seems to be a quite big change in our recovery policy.

Well, per se the report that led to this thread. We may lose data and
finish with corrupted pages. I was planning to reply to the other
thread [1] and the patch of Noah anyway, because we have to fix the
detection of OOM vs corrupted records in the allocation path anyway.
The infra introduced by 0001 and something like 0002 that allows the
startup process to take a different path depending on the type of
error are still needed to avoid a too early end-of-recovery, though.

>>> There are a few options on the table, only doable once the WAL reader
>>> provider the error state to the startup process:
>>> 1) Retry a few times and FATAL.
>>> 2) Just FATAL immediately and don't wait.
>>> 3) Retry and hope for the best that the host calms down.
>>
>> 4) Wrap up recovery then continue to normal operation.
>>
>> This is the traditional behavior for currupt WAL data.

Yeah, we've been doing a pretty bad job in classifying the errors that
can happen doing WAL replay in crash recovery, because we assume that
all the WAL still in pg_wal/ is correct. That's not an easy problem,
because the record CRC works for all the contents of the record, and
we look at the record header before that. Another idea may be to have
an extra CRC only for the header itself, that can be used in isolation
as one of the checks in XLogReaderValidatePageHeader().

>>> I have not seeing this issue being much of an issue in the field, so
>>> perhaps option 2 with the structure of 0002 and a FATAL when we catch
>>> XLOG_READER_OOM in the switch would be enough. At least that's enough
>>> for the cases we've seen. I'll think a bit more about it, as well.
>>>
>>> Yeah, agreed. That's orthogonal to the issue reported by Ethan,
>>> unfortunately, where he was able to trigger the issue of this thread
>>> by manipulating the sizing of a host after producing a record larger
>>> than what the host could afford after the resizing :/
>>
>> I'm not entirely certain, but if you were to ask me which is more
>> probable during recovery - encountering a correct record that's too
>> lengthy for the server to buffer or stumbling upon a corrupt byte
>> sequence - I'd bet on the latter.

I don't really believe in chance when it comes to computer science,
facts and a correct detection of such facts are better :)

> ... of course this refers to crash recovery. For replication, we
> should keep retrying the current record until the operator commands
> promotion.

Are you referring about a retry if there is a standby.signal? I am a
bit confused by this sentence, because we could do a crash recovery,
then switch to archive recovery. So, I guess that you mean that on
OOM we should retry to retrieve WAL from the local pg_wal/ even in the
case where we are in the crash recovery phase, *before* switching to
archive recovery and a different source, right? I think that this
argument can go two ways, because it could be more helpful for some to
see a FATAL when we are still in crash recovery, even if there is a
standby.signal. It does not seem to me that we have a clear
definition about what to do in which case, either. Now we just fail
and hope for the best when doing crash recovery.

>> I'm not sure how often users encounter currupt WAL data, but I believe
>> they should have the option to terminate recovery and then switch to
>> normal operation.
>>
>> What if we introduced an option to increase the timeline whenever
>> recovery hits data error? If that option is disabled, the server stops
>> when recovery detects an incorrect data, except in the case of an
>> OOM. OOM cause record retry.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see. One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replication slot retaining WAL on
the primary. Perhaps that's just OK to keep doing that for a
standby. At least this makes the discussion easier for the sake of
this thread: just consider the case of crash recovery when we don't
have a standby.
--
Michael

In response to

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-10 01:15:40 from Kyotaro Horiguchi

Responses

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-10 05:47:51 from Kyotaro Horiguchi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2023-08-10 04:45:50	Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message	Masahiro Ikeda	2023-08-10 04:08:39	Re: Support to define custom wait events for extensions