Re: pgsql: Add test for postmaster crash restarts.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Add test for postmaster crash restarts.
Date: 2017-09-19 16:13:54
Message-ID: 17084.1505837634@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

I discovered that prairiedog has been hung up for many hours in the
013_crash_restart.pl. It looks to me like the explanation is that
the test has a race condition, because what I find in the postmaster
log is

2017-09-19 00:31:48.194 EDT [27839] [unknown] LOG: connection received: host=[local]
2017-09-19 00:31:48.203 EDT [27839] [unknown] LOG: connection authorized: user=buildfarm database=postgres
2017-09-19 00:31:48.218 EDT [27839] t/013_crash_restart.pl LOG: statement: CREATE TABLE alive(status text);
2017-09-19 00:31:48.266 EDT [27839] t/013_crash_restart.pl LOG: statement: INSERT INTO alive VALUES($$committed-before-sigquit$$);
2017-09-19 00:31:48.271 EDT [27839] t/013_crash_restart.pl LOG: statement: SELECT pg_backend_pid();
2017-09-19 00:31:48.278 EDT [27839] t/013_crash_restart.pl LOG: statement: BEGIN;
2017-09-19 00:31:48.280 EDT [27839] t/013_crash_restart.pl LOG: statement: INSERT INTO alive VALUES($$in-progress-before-sigquit$$) RETURNING status;
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl WARNING: terminating connection because of crash of another server process
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted
shared memory.
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-09-19 00:31:48.299 EDT [27827] LOG: server process (PID 27839) exited with exit code 2
2017-09-19 00:31:48.299 EDT [27827] DETAIL: Failed process was running: INSERT INTO alive VALUES($$in-progress-before-sigquit$$) RETURNING status;
2017-09-19 00:31:48.300 EDT [27827] LOG: terminating any other active server processes
2017-09-19 00:31:48.307 EDT [27832] WARNING: terminating connection because of crash of another server process
2017-09-19 00:31:48.307 EDT [27832] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-09-19 00:31:48.307 EDT [27832] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-09-19 00:31:48.317 EDT [27827] LOG: all server processes terminated; reinitializing
2017-09-19 00:31:48.333 EDT [27840] LOG: database system was interrupted; last known up at 2017-09-19 00:31:47 EDT
2017-09-19 00:31:48.338 EDT [27840] LOG: database system was not properly shut down; automatic recovery in progress
2017-09-19 00:31:48.346 EDT [27840] LOG: redo starts at 0/15A89EC
2017-09-19 00:31:48.361 EDT [27840] LOG: invalid record length at 0/15C6D74: wanted 24, got 0
2017-09-19 00:31:48.362 EDT [27840] LOG: redo done at 0/15C6D50
2017-09-19 00:31:48.362 EDT [27840] LOG: last completed transaction was at log time 2017-09-19 00:31:48.270076-04
2017-09-19 00:31:48.474 EDT [27827] LOG: database system is ready to accept connections
2017-09-19 00:31:48.492 EDT [27847] [unknown] LOG: connection received: host=[local]
2017-09-19 00:31:48.499 EDT [27847] [unknown] LOG: connection authorized: user=buildfarm database=postgres
2017-09-19 00:31:48.578 EDT [27847] t/013_crash_restart.pl LOG: statement: SELECT pg_sleep(3600);

IOW, the "$monitor" instance of psql did not complete making its
connection until after the crash/restart cycle had occurred.
So we're just sitting there waiting for a crash report that won't
come. Which is another very serious deficiency in this test:
lacking any sort of timeout, it will just freeze indefinitely
if anything doesn't happen exactly the way it expects. From a
buildfarm owner's standpoint, that's pretty damn unfriendly.
It means having to manually unwedge your animals from time to time.

I'd like to ask you to revert this test, at least pending making
it a whole lot more bulletproof. We don't really need crash
recovery testing in the buildfarm IMO --- we hackers crash the
system plenty often enough to notice problems there.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Bruce Momjian 2017-09-19 16:23:29 pgsql: doc: add example of % substitution for connection URIs
Previous Message Andres Freund 2017-09-19 16:10:12 Re: Re: [COMMITTERS] pgsql: Perform only one ReadControlFile() during startup.

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-09-19 16:14:18 Re: Re: issue: record or row variable cannot be part of multiple-item INTO list
Previous Message Pavel Stehule 2017-09-19 16:13:43 Re: type cache for concat functions