Re: tests against running server occasionally fail, postgres_fdw & tenk1

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <fujii(at)postgresql(dot)org>
Subject: Re: tests against running server occasionally fail, postgres_fdw & tenk1
Date: 2023-02-26 19:43:40
Message-ID: 20230226194340.u44bkfgyz64c67i6@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> The most frequent case is postgres_fdw, which somewhat regularly fails with a
> regression.diff like this:
>
> diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 2022-12-08 20:35:24.772888000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 2022-12-08 20:43:38.199450000 +0000
> @@ -9911,8 +9911,7 @@
> WHERE application_name = 'fdw_retry_check';
> pg_terminate_backend
> ----------------------
> - t
> -(1 row)
> +(0 rows)
>
> -- This query should detect the broken connection when starting new remote
> -- transaction, reestablish new connection, and then succeed.
>
>
> See e.g.
> https://cirrus-ci.com/task/5925540020879360
> https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
> https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log
>
>
> The following comment in the test provides a hint what might be happening:
>
> -- If debug_discard_caches is active, it results in
> -- dropping remote connections after every transaction, making it
> -- impossible to test termination meaningfully. So turn that off
> -- for this test.
> SET debug_discard_caches = 0;
>
>
> I guess that a cache reset message arrives and leads to the connection being
> terminated. Unfortunately that's hard to see right now, as the relevant log
> messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> tests will be painful.

Downthread I reported that I was able to pinpoint that the source of the issue
indeed is a cache inval message arriving in the wrong moment.

We've had trouble with this test for years by now. We added workarounds, like

commit 1273a15bf91fa322915e32d3b6dc6ec916397268
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2021-05-04 13:36:26 -0400

Disable cache clobber to avoid breaking postgres_fdw termination test.

But that didn't suffice to make it reliable. Not entirely surprising, given
there are cache resource sources other than clobber cache.

Unless somebody comes up with a way to make the test more reliable pretty
soon, I think we should just remove it. It's one of the most frequently
flapping tests at the moment.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-02-26 19:51:45 Re: tests against running server occasionally fail, postgres_fdw & tenk1
Previous Message Tom Lane 2023-02-26 19:40:00 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)