Re: Random pg_upgrade test failure on drongo

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: "'pgsql-hackers(at)lists(dot)postgresql(dot)org'" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Random pg_upgrade test failure on drongo
Date: 2023-11-30 16:00:01
Message-ID: 093b8ab5-e634-9ec6-4f59-b5c659ebb8f7@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Andrew and Kuroda-san,

27.11.2023 16:58, Andrew Dunstan wrote:
>>> It's also interesting, what is full version/build of OS on drongo and
>>> fairywren.
>>>
>>
>> It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122
>>
>
> I've updated it to 17763.5122 now.
>

Thank you for the information! It had pushed me to upgrade my Server
2019 1809 17763.592 to 17763.4252. And then I discovered that I have
difficulties with reproducing the issue on all my VMs after reboot (even
on old versions/builds). It took me a while to understand what's going on
and what affects reproduction of the issue.
I was puzzled by the fact that I can't reproduce the issue with my
unlink-open test program under seemingly the same conditions as before,
until I realized that the issue reproduced only when the target directory
opened in Windows Explorer.

Now I'm sorry for bringing more mystery into the topic and for misleading
information.

So, the issue reproduced only when something scans the working directory
for files/opens them. I added the same logic into my test program (see
unlink-open-scandir attached) and now I see the failure on Windows Server
2019 (Version 10.0.17763.4252).
A script like this:
start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1"
...
start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1"

results in:
C:\temp>find "failed" log*
---------- LOG1
---------- LOG10
fopen() after unlink() failed (13)
---------- LOG2
fopen() after unlink() failed (13)
---------- LOG3
fopen() after unlink() failed (13)
---------- LOG4
fopen() after unlink() failed (13)
---------- LOG5
fopen() after unlink() failed (13)
---------- LOG6
fopen() after unlink() failed (13)
---------- LOG7
fopen() after unlink() failed (13)
---------- LOG8
fopen() after unlink() failed (13)
---------- LOG9
fopen() after unlink() failed (13)

C:\temp>type log10
...
iteration 108
fopen() after unlink() failed (13)

The same observed on:
Windows 10 Version 1809 (OS Build 17763.1)

But no failures on:
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So the behavior change really took place, but my previous estimations were
incorrect (my apologies).

BTW, "rename" mode of the test program can produce more rare errors on
rename:
---------- LOG3
MoveFileEx() failed (0)

but not on open.

30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote:

> Thanks for your interest for the issue. I have been tracking the failure but been not occurred.
> Your analysis seems to solve BF failures, by updating OSes.

Yes, but I don't think that leaving Server 2019 behind (I suppose Windows
Server 2019 build 20348 would have the same behaviour as Windows 10 19045)
is affordable. (Though looking at Cirrus CI logs, I see that what is
entitled "Windows Server 2019" in fact is Windows Server 2022 there.)

>> I think that's because unlink() is performed asynchronously on those old
>> Windows versions, but rename() is always synchronous.
>
> OK. Actually I could not find descriptions about them, but your experiment showed facts.

I don't know how this peculiarity is called, but it looks like when some
other process captures the file handle, unlink() exits as if the file was
deleted completely, but the subsequent open() fails.

Best regards,
Alexander

Attachment Content-Type Size
unlink-open-scandir.c text/x-csrc 2.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Wirch, Eduard 2023-11-30 16:24:57 Re: PostgreSql: Canceled on conflict out to old pivot
Previous Message Pavel Stehule 2023-11-30 15:45:26 Re: proposal: possibility to read dumped table's name from file