Re: Random pg_upgrade test failure on drongo

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "andrew(at)dunslane(dot)net" <andrew(at)dunslane(dot)net>
Subject: Re: Random pg_upgrade test failure on drongo
Date: 2024-01-03 11:42:21
Message-ID: CAA4eK1LwNU3rpV4qSaVgrTJ=pCos9OOsQdN3xDaTdVp05bZRmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:
> > Dear Alexander,
> >
> >> I agree with your analysis and would like to propose a PoC fix (see
> >> attached). With this patch applied, 20 iterations succeeded for me.
> > There are no reviewers so that I will review again. Let's move the PoC
> > to the concrete patch. Note that I only focused on fixes of random failure,
> > other parts are out-of-scope.
>
> Thinking about that fix more, I'm not satisfied with the approach proposed.
> First, it turns every unlink operation into two write operations
> (rename + unlink), not to say about new risks of having stale .tmp files
> (perhaps, it's ok for regular files (MoveFileEx can overwrite existing
> files), but not for directories)
> Second, it does that on any Windows OS versions, including modern ones,
> which are not affected by the issue, as we know.
>
> So I started to think about other approach: to perform unlink as it's
> implemented now, but then wait until the DELETE_PENDING state is gone.
>

There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."

> And I was very surprised to see that this state is not transient in our case.
> Additional investigation showed that the test fails not because some aside
> process opens a file (concretely, {template1_id/postgres_id}/2683), that is
> being deleted, but because of an internal process that opens the file and
> holds a handle to it indefinitely.
> And the internal process is ... background writer (BgBufferSync()).
>
> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> got 20 x 10 tests passing.
>
> Thus, it we want just to get rid of the test failure, maybe it's enough to
> add this to the test's config...
>

What about checkpoints? Can't it do the same while writing the buffers?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-01-03 11:46:36 Re: Transaction timeout
Previous Message Melih Mutlu 2024-01-03 11:40:01 Re: Parent/child context relation in pg_get_backend_memory_contexts()