Re: [BUG] non archived WAL removed during production crash recovery

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-01 08:27:22
Message-ID: d0ef12c5-7e36-b9b3-c8ea-2e98d69d1317@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 2020/04/01 0:22, Jehan-Guillaume de Rorthais wrote:
> Hello,
>
> A colleague of mine reported an expected behavior.
>
> On production cluster is in crash recovery, eg. after killing a backend, the
> WALs ready to be archived are removed before being archived.
>
> See in attachment the reproduction script "non-arch-wal-on-recovery.bash".
>
> This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
> Function XLogArchiveCheckDone() badly consider the in crashed recovery
> production cluster as a standby without archive_mode=always. So the check
> conclude the WAL can be removed safely.

Thanks for the report! Yeah, this seems a bug.

> bool inRecovery = RecoveryInProgress();
>
> /*
> * The file is always deletable if archive_mode is "off". On standbys
> * archiving is disabled if archive_mode is "on", and enabled with
> * "always". On a primary, archiving is enabled if archive_mode is "on"
> * or "always".
> */
> if (!((XLogArchivingActive() && !inRecovery) ||
> (XLogArchivingAlways() && inRecovery)))
> return true;
>
> Please find in attachment a patch that fix this issue using the following test
> instead:
>
> if (!((XLogArchivingActive() && !StandbyModeRequested) ||
> (XLogArchivingAlways() && inRecovery)))
> return true;
>
> I'm not sure if we should rely on StandbyModeRequested for the second part of
> the test as well thought. What was the point to rely on RecoveryInProgress() to
> get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call. So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.
Another is to make the startup process remove .ready file if necessary.

If it's not easy to fix the issue, we might need to just revert the commit
that introduced the issue at first.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-04-01 08:51:56 BUG #16331: segfault in checkpointer with full disk
Previous Message Michael Paquier 2020-04-01 07:08:42 Re: BUG #16325: Assert failure on partitioning by int for a text value with a collation

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-04-01 08:37:23 Re: shared-memory based stats collector
Previous Message Michael Paquier 2020-04-01 08:18:28 Re: pgbench - add \aset to store results of a combined query