From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
Subject: | tests against running server occasionally fail, postgres_fdw & tenk1 |
Date: | 2022-12-09 00:15:11 |
Message-ID: | 20221209001511.n3vqodxobqgscuil@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Since
commit 3f0e786ccbf
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: 2022-12-07 12:13:35 -0800
meson: Add 'running' test setup, as a replacement for installcheck
CI tests the pg_regress/isolationtester tests that support doing so against a
running server.
Unfortunately cfbot shows that that doesn't work entirely reliably.
The most frequent case is postgres_fdw, which somewhat regularly fails with a
regression.diff like this:
diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 2022-12-08 20:35:24.772888000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 2022-12-08 20:43:38.199450000 +0000
@@ -9911,8 +9911,7 @@
WHERE application_name = 'fdw_retry_check';
pg_terminate_backend
----------------------
- t
-(1 row)
+(0 rows)
-- This query should detect the broken connection when starting new remote
-- transaction, reestablish new connection, and then succeed.
See e.g.
https://cirrus-ci.com/task/5925540020879360
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log
The following comment in the test provides a hint what might be happening:
-- If debug_discard_caches is active, it results in
-- dropping remote connections after every transaction, making it
-- impossible to test termination meaningfully. So turn that off
-- for this test.
SET debug_discard_caches = 0;
I guess that a cache reset message arrives and leads to the connection being
terminated. Unfortunately that's hard to see right now, as the relevant log
messages are output with DEBUG3 - it's quite verbose, so enabling it for all
tests will be painful.
I *think* I have seen this failure locally at least once, when running the
test normally.
I'll try to reproduce this locally for a bit. If I can't, the only other idea
I have for debugging this is to change log_min_messages in that section of the
postgres_fdw test to DEBUG3.
The second failure case I observed, at a lower frequency, is in the main
regression tests:
https://cirrus-ci.com/task/5640584912699392
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/regress-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5640584912699392/testrun/build/testrun/runningcheck.log
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 2022-12-08 16:49:28.239508000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out 2022-12-08 16:57:17.286650000 +0000
@@ -1910,11 +1910,15 @@
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
ORDER BY unique1;
- QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_unique1 on tenk1
- Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
-(2 rows)
+ QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+ Sort Key: unique1
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+ -> Bitmap Index Scan on tenk1_unique1
+ Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(6 rows)
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
Which I think we've seen a number of times before, even in the temp-install
case. We fixed one source of this issue in this thread:
https://www.postgresql.org/message-id/1346227.1649887693%40sss.pgh.pa.us
but it looks like there's some remaining instability.
According to the server log (link above), there's no autovacuum on
tenk1.
Unfortunately we don't log non-automatic vacuums unless they are run with
verbose, so we can't see what horizon was used (see heap_vacuum_rel()'s
computation of 'instrument').
I don't have a better idea than to either change the above, or to revert
91998539b227dfc6dd091714da7d106f2c95a321.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2022-12-09 00:36:07 | Re: tests against running server occasionally fail, postgres_fdw & tenk1 |
Previous Message | Paul Ramsey | 2022-12-08 23:25:24 | Re: [PATCH] random_normal function |