Re: [BUG] non archived WAL removed during production crash recovery

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-18 09:26:11
Message-ID: 20200418092611.GD350229@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, Apr 17, 2020 at 03:33:04PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Fri, 17 Apr 2020 15:50:43 +0900 Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> Not sure that it is something that matters for this thread though, so
>> if necessary I think that it could be discussed separately.
>
> OK. However, unless I'm wrong, what I am describing as a desired behavior
> is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
> v8 should return false during crash recovery no matter the archive_mode setup,
> or if we keep the current behavior. I vote for keeping it this way.

I would rather avoid that now, as we don't check explicitely for crash
recovery in this code path. And for the purpose of this patch it is
fine to stick with the extra check on a standby with
(RECOVERY_STATE_ARCHIVE && archive_mode = always).

> Maybe we could use something more common for all plateform? Eg.:
>
> archive_command='this command does not exist'
>
> At least, we would have the same error everywhere, as far as it could matter...

Yeah. We could try to do with "false" as command anyway, and see what
the buildfarm thinks. As the test is skipped on Windows, I would
assume that it does not matter much anyway. Let's see what others
think about this piece. I don't have plans to touch again this patch
until likely the middle of next week.

>> Thanks, I have worked more on the test, refactoring pieces related to
>> the segment names, adjusting some comments and fixing some of the
>> logic. Note that you introduced something incorrect at the creation
>> of $standby2 as you have been updating postgresql.conf.auto for
>> $standby1.
>
> erf, last minute quick edit with lack of review on my side :(

No problem. It happens.

>> I have noticed an extra issue while looking at the backend pieces
>> today: at the beginning of the REDO loop we forgot one place where
>> SharedRecoveryState *has* to be updated to a correct state (around
>> the comment "Update pg_control to show that we are..." in xlog.c) as
>> the startup process may decide to switch the control file state to
>> DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
>> update the new shared flag at this early stage. It did not matter
>> before because SharedRecoveryInProgress would be only "true" for both,
>> but that's not the case anymore as we need to make the difference
>> between crash recovery and archive recovery in the new flag. There is
>> no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
>> as the initial shared memory state is RECOVERY_STATE_CRASH, but
>> updating the flag makes the code more consistent IMCRASHO so I updated it
>> anyway in the attached.
>
> Grmbl...I had this logic the other way around: init with
> RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
> removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
> based on ControlFile->state.
>
> Sorry, I forgot it after discussing the init value in v5 :(

Indeed. The extra initialization was part of v4, and got removed as
of v5. Still, it seems to me that this part was not complete without
updating the shared memory field correctly at the beginning of the
REDO processing as the last version of the patch does.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2020-04-18 09:28:56 Re: BUG #16369: Segmentation Faults and Data Corruption with Generated Columns
Previous Message David Rowley 2020-04-18 02:13:17 Re: BUG #16369: Segmentation Faults and Data Corruption with Generated Columns

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-04-18 09:27:35 Re: fixing old_snapshot_threshold's time->xid mapping
Previous Message Erik Rijkers 2020-04-18 09:10:50 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions