Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-08 20:59:32
Message-ID: 56B901B4.2020207@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/08/2016 02:15 PM, Tom Lane wrote:
> Of late, by far the majority of the random-noise failures we see in the
> buildfarm have come from failure to shut down the postmaster in a
> reasonable timeframe. An example is this current failure on hornet:
>
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55
>
> waiting for server to shut down........................................................................................................................... failed
> pg_ctl: server does not shut down
> =========== db log file ==========
> 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request
> 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions
> 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down
> 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down
>
> The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
> shows that pg_ctl was honoring that, so apparently the postmaster took
> more than 2 minutes to shut down.
>
> Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
> 40 or so seconds on this machine. So it's possible that it was just so
> overloaded that it took three times that much time on this run, but I'm
> starting to think there may be more to it than that. We've seen variants
> on this theme on half a dozen machines just in the past week --- and it
> seems to mostly happen in 9.5 and HEAD, which is fishy.
>
> The fact that we got "shutting down" but not "database system is shut
> down" indicates that the server was in ShutdownXLOG(). Normally the
> only component of that that takes much time is buffer flushing, but
> could something else be happening? Or perhaps the checkpoint code
> has somehow not gotten the word to do its flushing at full speed?
>
> What I'd like to do to investigate this is put in a temporary HEAD-only
> patch that makes ShutdownXLOG() and its subroutines much chattier about
> how far they've gotten and what time it is, and also makes pg_ctl print
> out the current time if it gives up waiting. A few failed runs with
> that in place will at least allow us to confirm or deny whether it's
> just that the shutdown checkpoint is sometimes really slow, or whether
> there's a bug lurking.
>
> Any objections? Anybody have another idea for data to collect?
>
>

I think that's an excellent idea. This has been a minor running sore for
ages on slow machines.

cheers

andrew

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-02-08 21:02:01 Re: [patch] Proposal for \crosstabview in psql
Previous Message Vik Fearing 2016-02-08 20:50:19 Re: proposal: make NOTIFY list de-duplication optional