Quick Links

Re: Permission failures with WAL files in 13~ on Windows

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Permission failures with WAL files in 13~ on Windows
Date:	2021-03-18 03:01:40
Message-ID:	YFLClG7KfETQ+xFG@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.

Yep. That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server. This has worked for me. Once. But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.

> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.

You actually mean durable_rename_excl() as of 13~, right? Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.

> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.

Yeap. I was thinking about 3) being the actual problem while going
through those docs two days ago.

> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
>
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
>
> Plus error reporting for unlink failures, of course.

Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs. Still no luck unfortunately :(
--
Michael

In response to

Re: Permission failures with WAL files in 13~ on Windows at 2021-03-18 02:30:04 from Andres Freund

Responses

Re: Permission failures with WAL files in 13~ on Windows at 2021-03-22 05:46:15 from Michael Paquier

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2021-03-18 03:02:50	Re: Getting better results from valgrind leak tracking
Previous Message	Justin Pryzby	2021-03-18 03:00:03	Re: Parallel INSERT (INTO ... SELECT ...)