Re: Random pg_upgrade test failure on drongo

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "andrew(at)dunslane(dot)net" <andrew(at)dunslane(dot)net>
Subject: Re: Random pg_upgrade test failure on drongo
Date: 2024-01-10 09:31:30
Message-ID: CAA4eK1Lq75HXRxucGrKzWNk8540kdk9dj0B4-6DMcHAZ+CE5+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 9, 2024 at 4:30 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> 09.01.2024 13:08, Amit Kapila wrote:
> >
> >> As to checkpoint_timeout, personally I would not increase it, because it
> >> seems unbelievable to me that pg_restore (with the cluster containing only
> >> two empty databases) can run for longer than 5 minutes. I'd rather
> >> investigate such situation separately, in case we encounter it, but maybe
> >> it's only me.
> >>
> > I feel it is okay to set a higher value of checkpoint_timeout due to
> > the same reason though the probability is less. I feel here it is
> > important to explain in the comments why we are using these settings
> > in the new test. I have thought of something like: "During the
> > upgrade, bgwriter or checkpointer could hold the file handle for some
> > removed file. Now, during restore when we try to create the file with
> > the same name, it errors out. This behavior is specific to only some
> > specific Windows versions and the probability of seeing this behavior
> > is higher in this test because we use wal_level as logical via
> > allows_streaming => 'logical' which in turn sets shared_buffers as
> > 1MB."
> >
> > Thoughts?
>
> I would describe that behavior as "During upgrade, when pg_restore performs
> CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file
> handle for pg_largeobject, so later TRUNCATE pg_largeobject command will
> fail if OS (such as older Windows versions) doesn't remove an unlinked file
> completely till it's open. ..."
>

I am slightly hesitant to add any particular system table name in the
comments as this can happen for any other system table as well, so
slightly adjusted the comments in the attached. However, I think it is
okay to mention the particular system table name in the commit
message. Let me know what do you think.

--
With Regards,
Amit Kapila.

Attachment Content-Type Size
v2-0001-Fix-an-intermetant-BF-failure-in-003_logical_slot.patch application/octet-stream 2.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-10 09:34:15 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message Shlok Kyal 2024-01-10 09:29:22 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication