Tracing down buildfarm "postmaster does not shut down" failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-08 19:15:48
Message-ID: 17763.1454958948@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe. An example is this current failure on hornet:

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

waiting for server to shut down........................................................................................................................... failed
pg_ctl: server does not shut down
=========== db log file ==========
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions
2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down
2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down

The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
shows that pg_ctl was honoring that, so apparently the postmaster took
more than 2 minutes to shut down.

Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
40 or so seconds on this machine. So it's possible that it was just so
overloaded that it took three times that much time on this run, but I'm
starting to think there may be more to it than that. We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

The fact that we got "shutting down" but not "database system is shut
down" indicates that the server was in ShutdownXLOG(). Normally the
only component of that that takes much time is buffer flushing, but
could something else be happening? Or perhaps the checkpoint code
has somehow not gotten the word to do its flushing at full speed?

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting. A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections? Anybody have another idea for data to collect?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2016-02-08 19:22:32 Re: pgbench stats per script & other stuff
Previous Message Alvaro Herrera 2016-02-08 19:13:04 Re: [PROPOSAL] VACUUM Progress Checker.