Incorrect handling of OOM in WAL replay leading to data loss

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: ethmertz(at)amazon(dot)com, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Incorrect handling of OOM in WAL replay leading to data loss
Date: 2023-08-01 03:43:21
Message-ID: ZMh/
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space. In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG: redo starts at 0/1913CD0
LOG: out of memory while trying to decode a record of length 57
LOG: redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.


Attachment Content-Type Size
xlogreader_oom.bash text/plain 942 bytes
0001-Tweak-to-force-OOM-behavior-when-replaying-records.patch text/x-diff 1.5 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-08-01 04:14:02 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message Andres Freund 2023-08-01 03:23:49 Re: Support to define custom wait events for extensions