Re: [BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-01 16:17:35
Message-ID: 20200401181735.11100908@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Wed, 1 Apr 2020 17:27:22 +0900
Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
[...]
> > bool inRecovery = RecoveryInProgress();
> >
> > /*
> > * The file is always deletable if archive_mode is "off". On standbys
> > * archiving is disabled if archive_mode is "on", and enabled with
> > * "always". On a primary, archiving is enabled if archive_mode is "on"
> > * or "always".
> > */
> > if (!((XLogArchivingActive() && !inRecovery) ||
> > (XLogArchivingAlways() && inRecovery)))
> > return true;
> >
> > Please find in attachment a patch that fix this issue using the following
> > test instead:
> >
> > if (!((XLogArchivingActive() && !StandbyModeRequested) ||
> > (XLogArchivingAlways() && inRecovery)))
> > return true;
> >
> > I'm not sure if we should rely on StandbyModeRequested for the second part
> > of the test as well thought. What was the point to rely on
> > RecoveryInProgress() to get the recovery status from shared mem?
>
> Since StandbyModeRequested is the startup process-local variable,
> it basically cannot be used in XLogArchiveCheckDone() that other process
> may call.

Ok, you answered my wondering about using recovery status from shared mem. This
was obvious. Thanks for your help!

I was wondering if we could use "ControlFile->state != DB_IN_CRASH_RECOVERY".
It seems fine during crash recovery as the control file is updated before the
checkpoint, but it doesn't feel right for other code paths where the control
file might not be up-to-date on filesystem, right ?

> So another approach would be necessary... One straight idea is to
> add new shmem flag indicating whether we are in standby mode or not.

I was thinking about setting XLogCtlData->SharedRecoveryInProgress as an enum
using:

enum RecoveryState
{
NOT_IN_RECOVERY = 0
IN_CRASH_RECOVERY,
IN_ARCHIVE_RECOVERY
}

Please, find in attachment a patch implementing this.

Plus, I added a second commit to add one test in regard with this bug.

> Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

Regards,

Attachment Content-Type Size
0001-v2-Fix-WAL-retention-during-production-crash-recovery.patch text/x-patch 5.5 KB
0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Julien Rouhaud 2020-04-01 17:25:24 Re: BUG #16331: segfault in checkpointer with full disk
Previous Message Pavel Stehule 2020-04-01 16:12:06 Re: BUG #16332: The xmltable function returns 'comment()' as unusable xml appended together and having no xml tags

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-04-01 16:29:51 Re: proposal \gcsv
Previous Message Andres Freund 2020-04-01 16:13:54 Re: wraparound dangers in snapshot too old