[BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>
Subject: [BUG] non archived WAL removed during production crash recovery
Date: 2020-03-31 15:22:29
Message-ID: 20200331172229.40ee00dc@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following test
instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Regards,

Attachment Content-Type Size
non-arch-wal-on-recovery.bash application/octet-stream 943 bytes
0001-Fix-WAL-retention-during-production-crash-recovery.patch text/x-patch 1.1 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dennis Björklund 2020-04-01 05:08:34 Re: translation typos
Previous Message Devrim Gündüz 2020-03-31 14:56:55 Re: BUG #16307: pgdg11-updates-debuginfo YUM repository missing RHEL releasever directories

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-03-31 15:24:42 Re: A rather hackish POC for alternative implementation of WITH TIES
Previous Message Alvaro Herrera 2020-03-31 15:07:55 Re: [HACKERS] Restricting maximum keep segments by repslots