Re: tests against running server occasionally fail, postgres_fdw & tenk1

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: tests against running server occasionally fail, postgres_fdw & tenk1
Date: 2023-02-07 01:53:00
Message-ID: 20230207015300.7ql6omvdqjfks5tp@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-12-08 16:36:07 -0800, Andres Freund wrote:
> On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> > Unfortunately cfbot shows that that doesn't work entirely reliably.
> >
> > The most frequent case is postgres_fdw, which somewhat regularly fails with a
> > regression.diff like this:
> >
> > diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> > --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 2022-12-08 20:35:24.772888000 +0000
> > +++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 2022-12-08 20:43:38.199450000 +0000
> > @@ -9911,8 +9911,7 @@
> > WHERE application_name = 'fdw_retry_check';
> > pg_terminate_backend
> > ----------------------
> > - t
> > -(1 row)
> > +(0 rows)
> >
> > -- This query should detect the broken connection when starting new remote
> > -- transaction, reestablish new connection, and then succeed.

> >
> > I guess that a cache reset message arrives and leads to the connection being
> > terminated. Unfortunately that's hard to see right now, as the relevant log
> > messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> > tests will be painful.
> >
> > I *think* I have seen this failure locally at least once, when running the
> > test normally.
> >
> >
> > I'll try to reproduce this locally for a bit. If I can't, the only other idea
> > I have for debugging this is to change log_min_messages in that section of the
> > postgres_fdw test to DEBUG3.
>
> Oh. I tried to reproduce the issue, without success so far, but eventually my
> test loop got stuck in something I reported previously and forgot about since:
> https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de
>
> I wonder if the timing on the freebsd CI task works out to hitting a "smaller
> version" of the problem that eventually resolves itself, which then leads to a
> sinval reset getting sent, causing the observable problem.

The issue referenced above is now fixed, and I haven't seen instances of it
since then. I also just now fixed the issue that often lead to failing to
upload logs.

Unfortunately the fdw_retry_check issue from above has re-occurred since then:

https://cirrus-ci.com/task/4901157940756480
https://api.cirrus-ci.com/v1/artifact/task/4901157940756480/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs

Another run hit an issue we've been fighting repeatedly on the buildfarm / CI:
https://cirrus-ci.com/task/5527490404286464
https://api.cirrus-ci.com/v1/artifact/task/5527490404286464/testrun/build/testrun/regress-running/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 2023-02-06 23:52:43.627604000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out 2023-02-07 00:03:04.535232000 +0000
@@ -1930,12 +1930,13 @@
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Sort Key: thousand
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(4 rows)

SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)

I'd be tempted to disable the test, but it also found genuine issues in a
bunch of CF entries, and all these test instabilities seem like ones we'd also
see, with a lower occurrence on the buildfarm.

WRT the fdw_retry_check: I wonder if we should increase the log level of
a) pgfdw_inval_callback deciding to disconnect
b) ReceiveSharedInvalidMessages() deciding to reset

to DEBUG1, at least temporarily?

Alternatively we could add a
SET log_min_messages=debug4;
...
RESET log_min_messages;

around the postgres_fdw connection re-establishment test?

One thing nudging me towards the more global approach is that I have the vague
feelign that there's a wider issue around hitting more sinval resets than we
should - and it'd right now be very hard to know about that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-02-07 02:03:05 RE: Perform streaming logical transactions by background workers and parallel apply
Previous Message Peter Smith 2023-02-07 00:32:43 Re: Time delayed LR (WAS Re: logical replication restrictions)