Re: [BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-17 13:33:04
Message-ID: 20200417153304.01a226cc@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, 17 Apr 2020 15:50:43 +0900
Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael(at)paquier(dot)xyz>
> > wrote:
> >> Removing .done files does not matter as the segments related to them are
> >> already gone.
> >
> > Not necessarily. There's a time windows between the moment the archiver set
> > the .done file and when the checkpointer removes the associated WAL file.
> > So, after a PANIC because of lack of space, WAL associated with .done files
> > but not removed yet will be removed during the crash recovery.
>
> Not sure that it is something that matters for this thread though, so
> if necessary I think that it could be be discussed separately.

OK. However, unless I'm wrong, what I am describing as an desired behavior
is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
v8 should return false during crash recovery no matter the archive_mode setup,
or if we keep the curent behavior. I vote for keeping it this way.

[...]
> > Unless I'm wrong, the empty string does not raise an error in
> > pg_stat_archiver, and I wanted to add a test on this as well.
>
> Exactly, it won't raise an error. Instead I switched to use a poll
> query with pg_stat_file() and .ready files, but this has proved to
> delay the test considerably if we did not create more segments. And
> your approach has the merit to be more simple with only two segments
> manipulated for the whole test. So I have tried first my idea,
> noticed the mess it introduced, and just kept your approach.

Maybe we could use something more common for all plateform? Eg.:

archive_command='this command does not exist'

At least, we would have the same error everywhere, as far as it could matter...

> > Thanks for your review! Let me know if you want me to add/change/fix some
> > tests.
>
> Thanks, I have worked more on the test, refactoring pieces related to
> the segment names, adjusting some comments and fixing some of the
> logic. Note that you introduced something incorrect at the creation
> of $standby2 as you have been updating postgresql.conf.auto for
> $standby1.

erf, last minute quick edit with lack of review on my side :(

> I have noticed an extra issue while looking at the backend pieces
> today: at the beginning of the REDO loop we forgot one place where
> SharedRecoveryState *has* to be updated to a correct state (around
> the comment "Update pg_control to show that we are..." in xlog.c) as
> the startup process may decide to switch the control file state to
> DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
> update the new shared flag at this early stage. It did not matter
> before because SharedRecoveryInProgress would be only "true" for both,
> but that's not the case anymore as we need to make the difference
> between crash recovery and archive recovery in the new flag. There is
> no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
> as the initial shared memory state is RECOVERY_STATE_CRASH, but
> updating the flag makes the code more consistent IMCRASHO so I updated it
> anyway in the attached.

Grmbl...I had this logic the other way around: init with
RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
based on ControlFile->state.

Sorry, I forgot it after discussing the init value in v5 :(

Regards,

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2020-04-17 13:34:09 Re: BUG #16373: Behavior of Temporary table creation
Previous Message Tom Lane 2020-04-17 13:29:36 Re: BUG #16374: I can't directly change owner from my created database to my created user.

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Morton 2020-04-17 13:36:03 Re: Support for DATETIMEOFFSET
Previous Message Julien Rouhaud 2020-04-17 13:24:35 Re: Lexer issues