Re: Random pg_upgrade 004_subscription test failure on drongo

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-13 12:40:50
Message-ID: cd3189f8-aaed-4ef6-a6b6-da72c1251f34@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13/03/2025 11:04, vignesh C wrote:
> ## Analysis
> I think it was caused due to the STATUS_DELETE_PENDING failure, not
> related with recent
> updates for pg_upgrade.
>
> The file "base/1/2683" is an index file for
> pg_largeobject_loid_pn_index, and the
> output meant that file creation failed. Below is a backtrace.
>
> ```
> pgwin32_open() // <-- this returns -1
> open()
> BasicOpenFilePerm()
> PathNameOpenFilePerm()
> PathNameOpenFile()
> mdcreate()
> smgrcreate()
> RelationCreateStorage()
> RelationSetNewRelfilenumber()
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
>
> But this is strange. Before calling mdcreate(), we surely unlink the
> file which have the same name. Below is a trace until unlink.
>
> ```
> pgunlink()
> unlink()
> mdunlinkfork()
> mdunlink()
> smgrdounlinkall()
> RelationSetNewRelfilenumber() // common path with above
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
>
> I found Thomas said that [4] pgunlink sometimes could not remove a
> file even if it returns OK, at that time NTSTATUS is
> STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
> mentions the same thing:
>
> ```
> /*
> * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
> * gone (Windows NT status code is STATUS_DELETE_PENDING). In that
> * case, we'd better ask for the NT status too so we can translate it
> * to a more Unix-like error. We hope that nothing clobbers the NT
> * status in between the internal NtCreateFile() call and CreateFile()
> * returning.
> *
> ```
>
> The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
> that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
> if the deletion is pending but it is trying to open.
> ---------------------------------------------
>
> This was fixed by the following change in the target upgrade nodes:
> bgwriter_lru_maxpages = 0
> checkpoint_timeout = 1h
>
> Attached is a patch in similar lines for 004_subscription.

Hmm, this problem isn't limited to this one pg_upgrade test, right? It
could happen with any pg_upgrade invocation. And perhaps in a running
server too, if a relfilenumber is reused quickly. In dropdb() and
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2025-03-13 12:40:51 Re: Allow default \watch interval in psql to be configured
Previous Message Ashutosh Bapat 2025-03-13 12:40:12 Re: Test to dump and restore objects left behind by regression