Re: Permission failures with WAL files in 13~ on Windows

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Permission failures with WAL files in 13~ on Windows
Date: 2021-03-18 03:01:40
Message-ID: YFLClG7KfETQ+xFG@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.

Yep. That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server. This has worked for me. Once. But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.

> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.

You actually mean durable_rename_excl() as of 13~, right? Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.

> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.

Yeap. I was thinking about 3) being the actual problem while going
through those docs two days ago.

> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
>
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
>
> Plus error reporting for unlink failures, of course.

Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs. Still no luck unfortunately :(
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-03-18 03:02:50 Re: Getting better results from valgrind leak tracking
Previous Message Justin Pryzby 2021-03-18 03:00:03 Re: Parallel INSERT (INTO ... SELECT ...)