TAP test fail: we don't always detect backend crashes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: TAP test fail: we don't always detect backend crashes
Date: 2022-05-04 16:25:50
Message-ID: 2366244.1651681550@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I discovered while poking at an LDAP problem that our TAP tests are
100% reproducibly capable of ignoring server crashes and reporting
success anyway. This problem occurs if the postmaster doesn't get
the child failure report until after shutdown has been initiated,
in which case you find something like this in the postmaster log:

2022-05-04 12:01:33.946 EDT [57945] [unknown] LOG: connection received: host=[local]
2022-05-04 12:01:33.995 EDT [57945] [unknown] LOG: connection authenticated: identity="uid=test1,dc=example,dc=net" method=ldap (/Users/tgl/pgsql/src/test/ldap/tmp_check/t_001_auth_node_data/pgdata/pg_hba.conf:1)
2022-05-04 12:01:33.995 EDT [57945] [unknown] LOG: connection authorized: user=test1 database=postgres application_name=001_auth.pl
2022-05-04 12:01:33.998 EDT [57945] 001_auth.pl LOG: statement: SELECT $$connected with user=test1$$
2022-05-04 12:01:34.003 EDT [57937] LOG: received fast shutdown request
2022-05-04 12:01:34.003 EDT [57937] LOG: aborting any active transactions
2022-05-04 12:01:34.003 EDT [57937] LOG: background worker "logical replication launcher" (PID 57943) exited with exit code 1
2022-05-04 12:01:35.750 EDT [57937] LOG: server process (PID 57945) was terminated by signal 11: Segmentation fault: 11
2022-05-04 12:01:35.751 EDT [57937] LOG: terminating any other active server processes
2022-05-04 12:01:35.751 EDT [57937] LOG: abnormal database system shutdown
2022-05-04 12:01:35.751 EDT [57937] LOG: database system is shut down

Our TAP scripts don't notice the "abnormal database system shutdown",
so it looks like things have passed. It's even worse when a script
demands an immediate shutdown, because then the postmaster won't wait
around for the child status.

If you have core dumps enabled, that adds some time before the child
exit status is delivered, making this scenario extremely probable.
I'm finding that src/test/ldap reports success 100% reproducibly
after doing "ulimit -c unlimited", even though four backend core dumps
are produced during the run.

I think that (a) at least by default, node->stop() ought to check for
normal shutdown status, and (b) immediate shutdown should only be used
in cases where we've already decided that the test failed.

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2022-05-04 18:38:25 Re: strange slow query - lost lot of time somewhere
Previous Message Alvaro Herrera 2022-05-04 16:03:07 Re: automatically generating node support functions