Re: Random pg_upgrade test failure on drongo

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "'pgsql-hackers(at)lists(dot)postgresql(dot)org'" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: "'andrew(at)dunslane(dot)net'" <andrew(at)dunslane(dot)net>
Subject: Re: Random pg_upgrade test failure on drongo
Date: 2024-01-02 05:00:00
Message-ID: 05da3f4c-6c55-6cd8-6011-0003c95ac310@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Kuroda-san,

28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Alexander,
>
>> I agree with your analysis and would like to propose a PoC fix (see
>> attached). With this patch applied, 20 iterations succeeded for me.
> There are no reviewers so that I will review again. Let's move the PoC
> to the concrete patch. Note that I only focused on fixes of random failure,
> other parts are out-of-scope.

Thinking about that fix more, I'm not satisfied with the approach proposed.
First, it turns every unlink operation into two write operations
(rename + unlink), not to say about new risks of having stale .tmp files
(perhaps, it's ok for regular files (MoveFileEx can overwrite existing
files), but not for directories)
Second, it does that on any Windows OS versions, including modern ones,
which are not affected by the issue, as we know.

So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.
And I was very surprised to see that this state is not transient in our case.
Additional investigation showed that the test fails not because some aside
process opens a file (concretely, {template1_id/postgres_id}/2683), that is
being deleted, but because of an internal process that opens the file and
holds a handle to it indefinitely.
And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

The other way to go is to find out whether the background writer process
should react on a shared-inval message, sent from smgrdounlinkall(), and
close that file's handle,

Maybe we could also (after changing the bgwriter's behaviour) add a waiting
loop into pgwin32_open_handle() to completely rule out transient open()
failures due to some other process (such as Windows Exporer) opening a file
being deleted, but I would not complicate the things until we have a clear
vision/plans of using modern APIs/relying of modern OS versions' behavior.
I mean proceeding with something like:
https://commitfest.postgresql.org/40/3951/

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-02 05:05:59 Re: Track in pg_replication_slots the reason why slots conflict?
Previous Message vignesh C 2024-01-02 02:45:40 Re: Commitfest manager January 2024