Quick Links

Incorrect handling of OOM in WAL replay leading to data loss

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	ethmertz(at)amazon(dot)com, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject:	Incorrect handling of OOM in WAL replay leading to data loss
Date:	2023-08-01 03:43:21
Message-ID:	ZMh/WV+CuknqePQQ@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space. In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG: redo starts at 0/1913CD0
LOG: out of memory while trying to decode a record of length 57
LOG: redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Comments?
--
Michael

Attachment	Content-Type	Size
xlogreader_oom.bash	text/plain	942 bytes
0001-Tweak-to-force-OOM-behavior-when-replaying-records.patch	text/x-diff	1.5 KB

Responses

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-01 04:51:13 from Kyotaro Horiguchi
Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-01 13:14:54 from Aleksander Alekseev

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Smith	2023-08-01 04:14:02	Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message	Andres Freund	2023-08-01 03:23:49	Re: Support to define custom wait events for extensions