Re: Random pg_upgrade test failure on drongo

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "andrew(at)dunslane(dot)net" <andrew(at)dunslane(dot)net>
Subject: Re: Random pg_upgrade test failure on drongo
Date: 2024-01-09 10:08:53
Message-ID: CAA4eK1K27-CE8OVJXOYLGdF9GVzJ7WdW5n_OFr1O10hHO_1mYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 9, 2024 at 2:30 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote:
> > Based on the suggestion by Amit, I have created a patch with the alternative
> > approach. This just does GUC settings. The reported failure is only for
> > 003_logical_slots, but the patch also includes changes for the recently added
> > test, 004_subscription. IIUC, there is a possibility that 004 would fail as well.
> >
> > Per our understanding, this patch can stop random failures. Alexander, can you
> > test for the confirmation?
> >
>
> Yes, the patch fixes the issue for me (without the patch I observe failures
> on iterations 1-2, with 10 tests running in parallel, but with the patch
> 10 iterations succeeded).
>
> But as far I can see, 004_subscription is not affected by the issue,
> because it doesn't enable streaming for nodes new_sub, new_sub1.
> As I noted before, I could see the failure only with
> shared_buffers = 1MB (which is set with allows_streaming => 'logical').
> So I'm not sure, whether we need to modify 004 (or any other test that
> runs pg_upgrade).
>

I see your point and the probable reason for failure with
shared_buffers=1MB is that the probability of bgwriter holding the
file handle for pg_largeobject increases. So, let's change it only for
003.

> As to checkpoint_timeout, personally I would not increase it, because it
> seems unbelievable to me that pg_restore (with the cluster containing only
> two empty databases) can run for longer than 5 minutes. I'd rather
> investigate such situation separately, in case we encounter it, but maybe
> it's only me.
>

I feel it is okay to set a higher value of checkpoint_timeout due to
the same reason though the probability is less. I feel here it is
important to explain in the comments why we are using these settings
in the new test. I have thought of something like: "During the
upgrade, bgwriter or checkpointer could hold the file handle for some
removed file. Now, during restore when we try to create the file with
the same name, it errors out. This behavior is specific to only some
specific Windows versions and the probability of seeing this behavior
is higher in this test because we use wal_level as logical via
allows_streaming => 'logical' which in turn sets shared_buffers as
1MB."

Thoughts?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-01-09 10:21:33 Re: [HACKERS] Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING
Previous Message vignesh C 2024-01-09 09:45:27 Re: POC: GROUP BY optimization