Re: Permission failures with WAL files in 13~ on Windows

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Permission failures with WAL files in 13~ on Windows
Date: 2021-03-16 09:02:25
Message-ID: CABUevEzV88=8MFfOBobo1uqYZ_54saNZe_wZ_Vs_aHwPj+oB6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 16, 2021 at 8:20 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> Hi all,
>
> There has been for the last couple of weeks a collection of reports
> complaining that the renaming of WAL segments is broken:
> https://www.postgresql.org/message-id/3861ff1e-0923-7838-e826-094cc9bef737@hot.ee
> https://www.postgresql.org/message-id/16874-c3eecd319e36a2bf@postgresql.org
> https://www.postgresql.org/message-id/095ccf8d-7f58-d928-427c-b17ace23cae6@burgess.co.nz
> https://www.postgresql.org/message-id/16927-67c570d968c99567%40postgresql.org
>
> These have happened on a variety of Windows versions, 2019 and 2012 R2
> being mentioned when segments are recycled.
>
> The number of those failures is alarming, and the information gathered
> points at 13.1 and 13.2 as the culprits where those failures are
> happening, so I'd like to believe that there is a regression in 13.

Agreed.

> FWIW, I have also been doing some tests on my side, and while I as not
> able to trigger the reported failure, I have been able to trigger the
> same error with an archive_command doing a simple cp that failed
> continuously on EACCES.
>
> Fujii-san has mentioned that on twitter, but one area that has changed
> during the v13 cycle is aaa3aed, where the code recycling segments has
> been switched from a pgrename() (with a retry loop) to a
> CreateHardLinkA()+pgunlink() (with a retry loop for the second). One
> theory that I got in mind here is the case where we create the hard
> link, but fail to finish do the pgunlink() on the xlogtemp.N file,
> though after some testing it did not seem to have any impact.

If you back out that patch, does the problem you can reproduce with
archive_command go away?

> I am running more tests with several scenarios (aggressive segment
> recycling or segment rotation) to get more reproducible scenarios,
> but I was wondering if anybody had ideas around that.
>
> So, thoughts?

I agree with your analysis in general. It certainly seems to hit right
in the center of the problem scope.

Maybe hardlinks on Windows has yet another "weird behaviour" vs what
we're used to from Unix.

It would definitely be more useful if we could figure out *when* this
happens. But failing that, I wonder if we could find a way to provide
a build with this patch backed out for the bug reporters to test out,
given they all seem to have it fairly well reproducible. (But I am
assuming are unlikely to be able to create their own builds easily,
given the complexity of doing so on Windows). Given that this is a
pretty isolated change, it should hopefully be easy enough to back out
for testing.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message gkokolatos 2021-03-16 09:13:51 Re: Allow batched insert during cross-partition updates
Previous Message Amit Langote 2021-03-16 08:59:39 Re: Allow batched insert during cross-partition updates