bailing out in tap tests nearly always a bad idea

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: bailing out in tap tests nearly always a bad idea
Date: 2022-02-13 23:22:49
Message-ID: 20220213232249.7sevhlioapydla37@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

A recent cfbot failure reminded me of a topic I wanted to raise: The use of
BAIL in tap tests.

When a tap test exits with BAIL it doesn't just consider that test failed, but
it also stops running further tap tests in the same group:
https://metacpan.org/pod/Test::More
"Indicates to the harness that things are going so badly all testing should
terminate. This includes the running of any additional test scripts."

There may be a few cases where stopping testing makes sense, but nearly all of
the uses of bail in our tests look like they just want to fail the the current
test, rather than other tests in the same prove invocation.

Besides aborting other tests, it's also just hard to parse the output:
https://cirrus-ci.com/task/6243385840238592?logs=test_world#L2329
...
[16:28:12.178] [16:27:23] t/024_archive_recovery.pl ............ ok 3440 ms ( 0.00 usr 0.00 sys + 1.43 cusr 0.64 csys = 2.07 CPU)
[16:28:12.178] [16:27:26] t/025_stuck_on_old_timeline.pl ....... ok 3024 ms ( 0.00 usr 0.00 sys + 1.21 cusr 0.61 csys = 1.82 CPU)
[16:28:12.178] [16:27:30] t/026_overwrite_contrecord.pl ........ ok 3693 ms ( 0.01 usr 0.00 sys + 1.05 cusr 0.42 csys = 1.48 CPU)
[16:28:12.178] Bailout called. Further testing stopped: command "/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress --dlpath="/tmp/cirrus-ci-build/src/test/regress" --bindir= --host=/tmp/gFHAMHHTJ_ --port=52708 --schedule=../regress/parallel_schedule --max-concurrent-tests=20 --inputdir=../regress --outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check"" exited with value 1
[16:28:12.178] FAILED--Further testing stopped: command "/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress --dlpath="/tmp/cirrus-ci-build/src/test/regress" --bindir= --host=/tmp/gFHAMHHTJ_ --port=52708 --schedule=../regress/parallel_schedule --max-concurrent-tests=20 --inputdir=../regress --outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check"" exited with value 1
[16:28:12.178] make[2]: *** [Makefile:27: check] Error 255
[16:28:12.178] make[1]: *** [Makefile:48: check-recovery-recurse] Error 2

What failed was 027_stream_regress.pl, due to a trivial regression test output
change:

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/guc.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/guc.out 2022-02-13 16:20:53.931949802 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out 2022-02-13 16:28:00.481307866 +0000
@@ -850,7 +850,8 @@
name
---------------------------
default_statistics_target
-(1 row)
+ enable_indexskipscan
+(2 rows)

-- Runtime-computed GUCs should be part of the preset category.
SELECT name FROM tab_settings_flags

But there's pretty much no way to know from the error output that this failure
was in 027_stream_regress.pl (just an example, this is widespread).

For some reason prove's output is much worse when the output is redirected to
a file. When testing with prove going to a terminal, bail at least prints the
test name.

interactive:
t/000_fail.pl ........................ Bailout called. Further testing stopped: I am bailing
FAILED--Further testing stopped: I am bailing

file:
Bailout called. Further testing stopped: I am bailing
FAILED--Further testing stopped: I am bailing

the latter is just about undebuggable, particularly when using
PROVE_FLAGS='-j...'.

Even just a 'die' is better than bail, output wise (and better due to not
preventing other tests from running):

interactive:
t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/001_stream_rep.pl .................. ok

file:
t/000_fail.pl ........................
Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/001_stream_rep.pl .................. ok

Does anybody want to defend / explain the use of BAIL_OUT? If not, I think we
should consider doing a global replace of the use of bailing. Best with a
central function signalling fatal error, rather than individual uses of die
or such.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-13 23:30:06 Re: [PATCH] Avoid open and lock the table Extendend Statistics (src/backend/commands/statscmds.c)
Previous Message Tomas Vondra 2022-02-13 22:59:13 Re: [PATCH] Avoid open and lock the table Extendend Statistics (src/backend/commands/statscmds.c)