tests against running server occasionally fail, postgres_fdw & tenk1

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: tests against running server occasionally fail, postgres_fdw & tenk1
Date: 2022-12-09 00:15:11
Message-ID: 20221209001511.n3vqodxobqgscuil@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Since

commit 3f0e786ccbf
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: 2022-12-07 12:13:35 -0800

meson: Add 'running' test setup, as a replacement for installcheck

CI tests the pg_regress/isolationtester tests that support doing so against a
running server.

Unfortunately cfbot shows that that doesn't work entirely reliably.

The most frequent case is postgres_fdw, which somewhat regularly fails with a
regression.diff like this:

diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 2022-12-08 20:35:24.772888000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 2022-12-08 20:43:38.199450000 +0000
@@ -9911,8 +9911,7 @@
WHERE application_name = 'fdw_retry_check';
pg_terminate_backend
----------------------
- t
-(1 row)
+(0 rows)

-- This query should detect the broken connection when starting new remote
-- transaction, reestablish new connection, and then succeed.

See e.g.
https://cirrus-ci.com/task/5925540020879360
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log

The following comment in the test provides a hint what might be happening:

-- If debug_discard_caches is active, it results in
-- dropping remote connections after every transaction, making it
-- impossible to test termination meaningfully. So turn that off
-- for this test.
SET debug_discard_caches = 0;

I guess that a cache reset message arrives and leads to the connection being
terminated. Unfortunately that's hard to see right now, as the relevant log
messages are output with DEBUG3 - it's quite verbose, so enabling it for all
tests will be painful.

I *think* I have seen this failure locally at least once, when running the
test normally.

I'll try to reproduce this locally for a bit. If I can't, the only other idea
I have for debugging this is to change log_min_messages in that section of the
postgres_fdw test to DEBUG3.

The second failure case I observed, at a lower frequency, is in the main
regression tests:
https://cirrus-ci.com/task/5640584912699392
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/regress-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/runningcheck.log

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 2022-12-08 16:49:28.239508000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out 2022-12-08 16:57:17.286650000 +0000
@@ -1910,11 +1910,15 @@
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
ORDER BY unique1;
- QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
- Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+ QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+ Sort Key: unique1
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+ -> Bitmap Index Scan on tenk1_unique1
+ Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)

SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)

Which I think we've seen a number of times before, even in the temp-install
case. We fixed one source of this issue in this thread:
https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us
but it looks like there's some remaining instability.

According to the server log (link above), there's no autovacuum on
tenk1.

Unfortunately we don't log non-automatic vacuums unless they are run with
verbose, so we can't see what horizon was used (see heap_vacuum_rel()'s
computation of 'instrument').

I don't have a better idea than to either change the above, or to revert
91998539b227dfc6dd091714da7d106f2c95a321.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-12-09 00:36:07 Re: tests against running server occasionally fail, postgres_fdw & tenk1
Previous Message Paul Ramsey 2022-12-08 23:25:24 Re: [PATCH] random_normal function