RE: Random pg_upgrade test failure on drongo

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Alexander Lakhin' <exclusion(at)gmail(dot)com>, "'andrew(at)dunslane(dot)net'" <andrew(at)dunslane(dot)net>
Cc: "'pgsql-hackers(at)lists(dot)postgresql(dot)org'" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Random pg_upgrade test failure on drongo
Date: 2023-11-30 10:00:21
Message-ID: TY3PR01MB9889CD6B11182AEBDA95B798F582A@TY3PR01MB9889.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Alexander, Andrew,

Thanks for your analysis!

> I see that behavior on:
> Windows 10 Version 1607 (OS Build 14393.0)
> Windows Server 2016 Version 1607 (OS Build 14393.0)
> Windows Server 2019 Version 1809 (OS Build 17763.1)
>
> But it's not reproduced on:
> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
> Windows Server 2019 Version 1809 (OS Build 17763.592)
> Windows 10 Version 22H2 (OS Build 19045.3693)
> Windows 11 Version 21H2 (OS Build 22000.613)
>
> So it looks like the failure occurs depending not on Windows edition, but
> rather on it's build. For Windows Server 2019 the "good" build is
> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
> 14393.0 and 17763.1.
> (Maybe there was some change related to
> FILE_DISPOSITION_POSIX_SEMANTICS/
> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
> information about that change.)
>
> It's also interesting, what is full version/build of OS on drongo and
> fairywren.

Thanks for your interest for the issue. I have been tracking the failure but been not occurred.
Your analysis seems to solve BF failures, by updating OSes.

> I think that's because unlink() is performed asynchronously on those old
> Windows versions, but rename() is always synchronous.

OK. Actually I could not find descriptions about them, but your experiment showed facts.

> I've managed to reproduce that issue (or at least a situation that
> manifested similarly) with a sleep added in miscinit.c:
> ereport(IsPostmasterEnvironment ? LOG : NOTICE,
> (errmsg("database system is shut down")));
> + pg_usleep(500000L);
>
> With this change, I get the same warning as in [1] when running in
> parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
> 33, 46, 8). And with my PoC patch applied, I could see the same warning
> as well (on iteration 6).
>
> I believe that's because rename() can't rename a directory containing an
> open file, just as unlink() can't remove it.
>
> In the light of the above, I think that the issue in question should be
> fixed in accordance with/as a supplement to [2].

OK, I understood that we need to fix more around here. For now, we should focus our points.

Your patch seems good, but it needs more sight from windows-friendly developers.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-11-30 10:05:23 Re: [PGDOCS] Inconsistent linkends to "monitoring" views.
Previous Message Dilip Kumar 2023-11-30 10:00:15 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock