|From:||Michael Paquier <michael(at)paquier(dot)xyz>|
|To:||Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>|
|Cc:||jgdr(at)dalibo(dot)com, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org|
|Subject:||Re: [BUG] non archived WAL removed during production crash recovery|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On Mon, Apr 20, 2020 at 04:02:31PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> As the result, +1 to what v7 is doing and discussing on earlier
> removal of such WAL segments separately if needed.
Thanks for the extra review.
>> Yeah. We could try to do with "false" as command anyway, and see what
>> the buildfarm thinks. As the test is skipped on Windows, I would
>> assume that it does not matter much anyway. Let's see what others
>> think about this piece. I don't have plans to touch again this patch
>> until likely the middle of next week.
> Couldn't we use "/" as a globally-results-in-failure command? But
> that doesn't increment failed_count. The reason is pgarch_archiveXLog
> exits with FATAL for "is a directory" error. The comment asserts that
> we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to
> check only exit-by-signal case. The following fix worked.
Yeah, I was working on this stuff today and I noticed this problem. I
was just going to send an email on the matter with a more portable
patch and you also just beat me to it with this one :)
So yes, using "false" may be a bad idea because we cannot rely on the
case where the command does not exist in an environment in this test.
After more testing, I have been hit hard about the fact that the
archiver exits immediately if an archive command cannot be found
(errcode = 127), and it does not report this failure back to
pg_stat_archiver, which would cause the test to wait until the timeout
of poll_query_until() kills the test. There is however an extra
method not mentioned yet on this thread: we know that cp/copy is
portable enough per the buildfarm, so let's use a copy command that we
know *will* fail. A simple way of doing this is a command where the
origin file does not exist.
> --- a/src/backend/postmaster/pgarch.c
> +++ b/src/backend/postmaster/pgarch.c
> @@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog)
> * "command not found" type of error. If we overreact it's no big
> * deal, the postmaster will just start the archiver again.
> - int lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;
> + int lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG;
> if (WIFEXITED(rc))
> I didn't tested it on Windows (I somehow broke my repo and it's too
> slow to clone.) but system("/") returned 1 and I think that result
> increments the counter.
No, this would be a behavior change, which is not acceptable in my
view. (By the way, just nuke your full repo if it does not work
anymore on Windows, this method works).
>> Indeed. The extra initialization was part of v4, and got removed as
>> of v5. Still, it seems to me that this part was not complete without
>> updating the shared memory field correctly at the beginning of the
>> REDO processing as the last version of the patch does.
> I may not be following the discussion, but I think it is reasonable
> that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE
> as needed and finished by NONE. That transition also stables
Thought as well about that over the weekend, and that's still the best
option to me.
> I think it would be better be RECOVERY_STATE_DONE.
I like this suggestion better than the original in v7.
> By the way I noticed that RecoveryState is exactly a subset of
> DBState. And changes of SharedRecoveryState happens side-by-side with
> ControlFileData->state in most places. Coundn't we just usee
> ControlFile->state instead of SharedRecoveryState?
I actually found confusing to use the same thing, because then the
reader would thing that SharedRecoveryState could be set to more
values but we don't want that.
> By the way I found a typo.
> +# Recovery tests for the achiving with a standby partially check
Attached is an updated patch, where I tweaked more comments.
Jehan-Guillaume, who is your colleague who found originally about this
problem? We should credit him in the commit message.
|Next Message||Sandeep Thakkar||2020-04-20 09:09:54||Re: BUG #16341: Installation with EnterpriseDB Community installer in NT AUTHORITY\SYSTEM context not possible|
|Previous Message||Kyotaro Horiguchi||2020-04-20 07:02:31||Re: [BUG] non archived WAL removed during production crash recovery|
|Next Message||Kyotaro Horiguchi||2020-04-20 07:46:56||Re: WAL usage calculation patch|
|Previous Message||Kyotaro Horiguchi||2020-04-20 07:24:06||Re: 001_rep_changes.pl stalls|