Quick Links

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

From:	Thomas Crayford <tcrayford(at)salesforce(dot)com>
To:	michael(at)paquier(dot)xyz
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
Date:	2018-10-01 11:43:02
Message-ID:	CAJgZ2Z4-dPQd1V7PS04JESELCEWtykCBtvcJ6Ezpd+7xW2qqiA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi there Michael,

Sorry for the slow response on this - I was oncall last week and it was
quite distracting and busy.

With respect to the restore_command, we use wal-e:
https://github.com/wal-e/wal-e, specifically:

envdir DIRECTORY wal-e wal-fetch "%f" "%p"

Thanks

Tom

On Fri, Sep 28, 2018 at 11:59 PM Michael Paquier <michael(at)paquier(dot)xyz>
wrote:

> On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote:
> > Ok, thanks for the pointer. It seems like the race condition I talked
> about
> > is still accurate, does that seem right?
>
> KeepFileRestoredFromArchive() looks like a good candidate on the matter
> as it removes a WAL segment before replacing it by another with the same
> name. I have a hard time understanding why the checkpointer would try
> to recycle a segment just recovered though as the startup process would
> immediately try to use it. I have not spent more than one hour looking
> at potential spots though, which is not much for this kind of race
> conditions.
>
> It is also why I am curious about what kind of restore_command you are
> using.
> --
> Michael
>

In response to

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery at 2018-09-28 22:59:17 from Michael Paquier

Responses

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery at 2018-10-02 01:06:49 from Michael Paquier

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	PG Bug reporting form	2018-10-01 12:56:30	BUG #15413: windows 10
Previous Message	Michael Paquier	2018-10-01 09:06:46	Re: BUG #15412: "invalid contrecord length" during WAL replica recovery