Quick Links

Re: Incorrect handling of OOM in WAL replay leading to data loss

From:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
To:	Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, ethmertz(at)amazon(dot)com, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject:	Re: Incorrect handling of OOM in WAL replay leading to data loss
Date:	2023-08-01 13:14:54
Message-ID:	CAJ7c6TN44p8zkVsj-MbGO8-dDg13Ci3UJEAOd-an4uY27rpPwg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

> As far as I can see, PerformWalRecovery() uses LOG as elevel
> [...]
> On top of my mind, any solution I can think of needs to add more
> information to XLogReaderState, where we'd either track the type of
> error that happened close to errormsg_buf which is where these errors
> are tracked, but any of that cannot be backpatched, unfortunately.

Probably I'm missing something, but if memory allocation is required
during WAL replay and it fails, wouldn't it be a better solution to
log the error and terminate the DBMS immediately?

Clearly Postgres doesn't have control of the amount of memory
available. It's up to the DBA to resolve the problem and start the
recovery again. If this happens on a replica, it indicates a
misconfiguration of the system and/or lack of the corresponding
configuration options.

Maybe a certain amount of memory should be reserved for the WAL replay
and perhaps other needs. In the recent case the system should account
for the overcommitment of the OS - cases when a successful malloc()
doesn't necessarily allocate the required amount of *physical* memory,
as it's done on Linux.

--
Best regards,
Aleksander Alekseev

In response to

Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-01 03:43:21 from Michael Paquier

Responses

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-01 23:39:54 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Anthonin Bonnefoy	2023-08-01 13:19:41	Re: POC: Extension for adding distributed tracing - pg_tracing
Previous Message	David Rowley	2023-08-01 12:55:13	Re: Performance degradation on concurrent COPY into a single relation in PG16.