Re: [BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, michael(at)paquier(dot)xyz
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-03 16:26:25
Message-ID: 20200403182625.0fccc6fd@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:

> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:
>> On Thu, 2 Apr 2020 19:38:59 +0900
>> Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
[...]
>>> That is, WAL files with .ready files are removed when either
>>> archive_mode!=always in standby mode or archive_mode=off.
>>
>> sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

> BTW, now I'm thinking that the flag in shmem should be updated when
> the startup process sets StandbyModeRequested to true at the beginning
> of the recovery. That is,
>
> - Add something like SharedStandbyModeRequested into XLogCtl. This field
> should be initialized with false;
> - Set XLogCtl->SharedStandbyModeRequested to true when the startup
> process detects the standby.signal file and sets the local variable
> StandbyModeRequested to true.
> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
> to know whether the server is in standby mode or not.
>
> Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

If my proposal is rejected, I'll be happy to volunteer to add
XLogCtl->SharedStandbyModeRequested though. It seems like a simple enough fix
as well.

Regards,

Attachment Content-Type Size
0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch text/x-patch 5.3 KB
0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Fujii Masao 2020-04-03 17:49:50 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Dmitry Dolgov 2020-04-03 15:29:10 Re: BUG #16325: Assert failure on partitioning by int for a text value with a collation

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-03 16:32:01 Re: zombie connections
Previous Message Ibrar Ahmed 2020-04-03 16:04:34 Re: VACUUM memory management