Re: [BUG] non archived WAL removed during production crash recovery

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: jgdr(at)dalibo(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org, michael(at)paquier(dot)xyz
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-02 10:38:59
Message-ID: ca964b3a-61a0-902e-c7b3-3abbc01a921f@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:
> At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>> On 2020/04/02 13:07, Kyotaro Horiguchi wrote:
>>> Sorry, it was quite ambiguous.
>>> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
>>> <horikyota(dot)ntt(at)gmail(dot)com> wrote in
>>>> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
>>>> <jgdr(at)dalibo(dot)com> wrote in
>>>>> Please, find in attachment a patch implementing this.
>>>>
>>>> The patch partially reintroduces the issue the patch have
>>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>>> .ready file.
>>> The patch partially reintroduces the issue the commit 78ea8b5daa have
>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>> .ready file.
>>
>> On second thought, I think that we should discuss what the desirable
>> behavior is before the implentation. Especially what's unclear to me
>
> Agreed.
>
>> is whether to remove such WAL files in archive recovery case with
>> archive_mode=on. Those WAL files would be required when recovering
>> from the backup taken before that archive recovery happens.
>> So it seems unsafe to remove them in that case.
>
> I'm not sure I'm getting the intention correctly, but I think it
> responsibility of the operator to provide a complete set of archived
> WAL files for a backup. Could you elaborate what operation steps are
> you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

>> Therefore, IMO that the patch should change the code so that
>> no unarchived WAL files are removed not only in crash recovery
>> but also archive recovery. Thought?
>
> Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
> and it should be archived immediately. My previous mail is written
> based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

> In a very narrow window, if server crashed or killed after a segment
> is finished but before marking the file as .ready, the file doesn't
> have .ready but should be archived. If we need to get rid of such a
> window, it would help to mark a WAL file as ".busy" at creation time.
>
>> Of course, this change would lead to the issue that the past
>> unarchived
>> WAL files keep remaining in the case of warm-standby using archive
>> recovery. But this issue looks unavoidable. If users want to avoid
>> that,
>> archive_mode should be set to always.
>>
>> Also I'm a bit wondering if it's really safe to remove such unarchived
>> WAL files even in the standby case with archive_mode=on. I would need
>> more time to think that.
>>
>>>> Perhaps checking '.ready' before the checking for archive-mode would
>>>> be sufficient.
>>>>
>>>>> Plus, I added a second commit to add one test in regard with this bug.
>>>>>
>>>>>> Another is to make the startup process remove .ready file if
>>>>>> necessary.
>>>>>
>>>>> I'm not sure to understand this one.
>>
>> I was thinking to make the startup process remove such unarchived WAL
>> files
>> if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
>> is true.
>
> As mentioned above, I don't understand the point of preserving WAL
> files that are either marked as .ready or not marked at all on a
> standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.
Anyway I'm fine to fix the bug based on the above consensus at first.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Julien Rouhaud 2020-04-02 11:04:36 Re: BUG #16334: We recently upgraded PG version from 9.5 to 10.10 and system performance is not so good
Previous Message Tejaswini GC 2020-04-02 10:22:56 Re: BUG #16334: We recently upgraded PG version from 9.5 to 10.10 and system performance is not so good

Browse pgsql-hackers by date

  From Date Subject
Next Message Kashif Zeeshan 2020-04-02 11:29:47 Re: WIP/PoC for parallel backup
Previous Message Rajkumar Raghuwanshi 2020-04-02 09:57:52 Re: WIP/PoC for parallel backup