Re: Incorrect handling of OOM in WAL replay leading to data loss

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject: Re: Incorrect handling of OOM in WAL replay leading to data loss
Date: 2023-08-10 05:59:07
Message-ID: ZNR8q8nevVK6IIgR@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
>> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
>> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
>> >> We have treated every kind of broken data as end-of-recovery, like
>> >> incorrect rm_id or prev link including excessively large record length
>> >> due to corruption. This patch is going to change the behavior only for
>> >> the last one. If you think there can't be non-zero broken data, we
>> >> should inhibit proceeding recovery after all non-zero incorrect
>> >> data. This seems to be a quite big change in our recovery policy.
>>
>> Well, per se the report that led to this thread. We may lose data and
>> finish with corrupted pages. I was planning to reply to the other
>> thread [1] and the patch of Noah anyway, because we have to fix the
>> detection of OOM vs corrupted records in the allocation path anyway.
>
> Does this mean we will address the distinction between an OOM and a
> corrupt total record length later on? If that's the case, should we
> modify that behavior right now?

My apologies if I sounded unclear here. It seems to me that we should
wrap the patch on [1] first, and get it backpatched. At least that
makes for less conflicts when 0001 gets merged for HEAD when we are
able to set a proper error code. (Was looking at it, actually.)

>>>> 4) Wrap up recovery then continue to normal operation.
>>>>
>>>> This is the traditional behavior for currupt WAL data.
>>
>> Yeah, we've been doing a pretty bad job in classifying the errors that
>> can happen doing WAL replay in crash recovery, because we assume that
>> all the WAL still in pg_wal/ is correct. That's not an easy problem,
>
> I'm not quite sure what "correct" means here. I believe xlogreader
> runs various checks since the data may be incorrect. Given that can
> break for various reasons, during crash recovery, we continue as long
> as incoming WAL record remains consistently valid. The problem raised
> here that we can't distinctly identify an OOM from a corrupted total
> record length field. One reason is we check the data after a part is
> loaded, but we can't load all the bytes from the record into memory in
> such cases.

Yep. Correct means that we end recovery in a consistent state, not
too early than we should.

>> because the record CRC works for all the contents of the record, and
>> we look at the record header before that. Another idea may be to have
>> an extra CRC only for the header itself, that can be used in isolation
>> as one of the checks in XLogReaderValidatePageHeader().
>
> Sounds reasonable. By using CRC to protect the header part and
> allocating a fixed-length buffer for it, I believe we're adopting a
> standard approach and can identify OOM and other kind of header errors
> with a good degree of certainty.

Not something that I'd like to cover in this patch set, though.. This
is a problem on its own.

>>>> I'm not entirely certain, but if you were to ask me which is more
>>>> probable during recovery - encountering a correct record that's too
>>>> lengthy for the server to buffer or stumbling upon a corrupt byte
>>>> sequence - I'd bet on the latter.
>>
>> I don't really believe in chance when it comes to computer science,
>> facts and a correct detection of such facts are better :)
>
> Even now, it seems like we're balancing the risk of potential data
> loss against the potential inability to start the server. I meant
> that.. if I had to do choose, I'd lean slightly towards prioritizing
> saving the latter, in other words, keeping the current behavior.

On OOM, this means data loss and silent corruption. A failure has the
merit to tell someone that something is wrong, at least, and that
they'd better look at it rather than hope for the best.

>>> ... of course this refers to crash recovery. For replication, we
>>> should keep retrying the current record until the operator commands
>>> promotion.
>>
>> Are you referring about a retry if there is a standby.signal? I am a
>> bit confused by this sentence, because we could do a crash recovery,
>> then switch to archive recovery. So, I guess that you mean that on
>> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
>> case where we are in the crash recovery phase, *before* switching to
>
> Apologies for the confusion. What I was thinking is that an OOM is
> more likely to occur in replication downstreams than during server
> startup. I also felt for the latter case that such a challenging
> environment probably wouldn't let the server enter stable normal
> operation.

It depends on what the user does with the host running the cluster.
Both could be impacted.

>>>> I'm not sure how often users encounter currupt WAL data, but I believe
>>>> they should have the option to terminate recovery and then switch to
>>>> normal operation.
>>>>
>>>> What if we introduced an option to increase the timeline whenever
>>>> recovery hits data error? If that option is disabled, the server stops
>>>> when recovery detects an incorrect data, except in the case of an
>>>> OOM. OOM cause record retry.
>>
>> I guess that it depends on how much responsiveness one may want.
>> Forcing a failure on OOM is at least something that users would be
>> immediately able to act on when we don't run a standby but just
>> recover from a crash, while a standby would do what it is designed to
>> do, aka continue to replay what it can see. One issue with the
>> wait-and-continue is that a standby may loop continuously on OOM,
>> which could be also bad if there's a replication slot retaining WAL on
>> the primary. Perhaps that's just OK to keep doing that for a
>> standby. At least this makes the discussion easier for the sake of
>> this thread: just consider the case of crash recovery when we don't
>> have a standby.
>
> Yeah, I'm with you on focusing on crash recovery cases; that's what I
> intended. Sorry for any confusion.

Okay, so we're on the same page here, keeping standbys as they are and
do something for the crash recovery case.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laetitia Avrot 2023-08-10 07:06:31 Re: Adding a pg_servername() function
Previous Message Peter Eisentraut 2023-08-10 05:58:27 Make all Perl warnings fatal