Re: Incorrect handling of OOM in WAL replay leading to data loss

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject: Re: Incorrect handling of OOM in WAL replay leading to data loss
Date: 2023-08-10 01:15:40
Message-ID: 20230810.101540.739766630476144825.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> > > While it's a kind of bug in total, we encountered a case where an
> > > excessively large xl_tot_len actually came from a corrupted
> > > record. [1]
> >
> > Right, I remember this one. I think that Thomas was pretty much right
> > that this could be caused because of a lack of zeroing in the WAL
> > pages.
>
> We have treated every kind of broken data as end-of-recovery, like
> incorrect rm_id or prev link including excessively large record length
> due to corruption. This patch is going to change the behavior only for
> the last one. If you think there can't be non-zero broken data, we
> should inhibit proceeding recovery after all non-zero incorrect
> data. This seems to be a quite big change in our recovery policy.
>
> > There are a few options on the table, only doable once the WAL reader
> > provider the error state to the startup process:
> > 1) Retry a few times and FATAL.
> > 2) Just FATAL immediately and don't wait.
> > 3) Retry and hope for the best that the host calms down.
>
> 4) Wrap up recovery then continue to normal operation.
>
> This is the traditional behavior for currupt WAL data.
>
> > I have not seeing this issue being much of an issue in the field, so
> > perhaps option 2 with the structure of 0002 and a FATAL when we catch
> > XLOG_READER_OOM in the switch would be enough. At least that's enough
> > for the cases we've seen. I'll think a bit more about it, as well.
> >
> > Yeah, agreed. That's orthogonal to the issue reported by Ethan,
> > unfortunately, where he was able to trigger the issue of this thread
> > by manipulating the sizing of a host after producing a record larger
> > than what the host could afford after the resizing :/
>
> I'm not entirely certain, but if you were to ask me which is more
> probable during recovery - encountering a correct record that's too
> lengthy for the server to buffer or stumbling upon a corrupt byte
> sequence - I'd bet on the latter.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

> I'm not sure how often users encounter currupt WAL data, but I believe
> they should have the option to terminate recovery and then switch to
> normal operation.
>
> What if we introduced an option to increase the timeline whenever
> recovery hits data error? If that option is disabled, the server stops
> when recovery detects an incorrect data, except in the case of an
> OOM. OOM cause record retry.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2023-08-10 02:09:10 Re: 2023-08-10 release announcement draft
Previous Message Masahiko Sawada 2023-08-10 01:15:38 Re: [PoC] pg_upgrade: allow to upgrade publisher node