Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-09 19:10:50
Message-ID: 17908.1455045050@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
>> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
>>> We've seen variants
>>> on this theme on half a dozen machines just in the past week --- and it
>>> seems to mostly happen in 9.5 and HEAD, which is fishy.

>> It has been affecting only the four AIX animals, which do share hardware.
>> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

While we've not seen an actual test failure since I installed the tracing
code, we do have reports from the four AIX critters, and they are pretty
instructive. See for example
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2
which is

Snapshot: 2016-02-09 16:27:05

waiting for server to shut down at 2016-02-09 18:04:09.159 UTC................................................................................ done
server stopped
=========== db log file ==========
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG: received fast shutdown request at 2016-02-09 18:04:09.159 UTC
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG: aborting any active transactions
2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG: autovacuum launcher shutting down at 2016-02-09 18:04:09.160 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG: shutting down at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG: CheckPointGuts starting at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG: CheckPointCLOG() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG: CheckPointCommitTs() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG: CheckPointSUBTRANS() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG: CheckPointMultiXact() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG: CheckPointPredicate() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG: CheckPointRelationMap() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG: CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG: CheckPointSnapBuild() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG: CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG: BufferSync(5) beginning to write 632 buffers at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG: BufferSync(5) done, wrote 632/632 buffers at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG: CheckPointBuffers() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG: CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG: CheckPointGuts done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG: checkpoint WAL record flushed at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG: pg_control updated at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG: in mdpostckpt, 3072 unlinks remain to do at 2016-02-09 18:04:29.375 UTC
2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG: in mdpostckpt, 2048 unlinks remain to do at 2016-02-09 18:04:48.207 UTC
2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG: in mdpostckpt, 1024 unlinks remain to do at 2016-02-09 18:05:07.381 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG: smgrpostckpt() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG: RemoveOldXlogFiles() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG: TruncateSUBTRANS() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG: shutdown checkpoint complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG: ShutdownCLOG() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG: ShutdownCommitTs() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG: ShutdownSUBTRANS() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG: database system is shut down at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG: lock files all released at 2016-02-09 18:05:26.165 UTC

This says that the bulk of the shutdown time is being spent in mdpostckpt,
where we are finally unlinking files belonging to relations that were
dropped since the last checkpoint. That took 75 seconds here, while the
parts before and after it took less than a second each. There were
somewhere between 3000 and 4000 files to be unlinked, so it's not
astonishing that it took some time, but our other critters seem to get
through this a lot quicker. (All four AIX critters reported pretty
similar results, BTW.)

So my diagnosis at this point is

(1) Slow file system, specifically slow unlink, is the core of the
problem. (I wonder if the AIX critters are using an NFS filesystem?)

(2) The apparent increase in failure rate in more recent branches is
probably real, but the likely cause is simply more test cases = more
stuff to be cleaned up at the end = mdpostckpt takes longer.

I'm not sure whether there's anything to be gained by leaving the tracing
code in there till we see actual buildfarm fails. There might be another
slowdown mechanism somewhere, but I rather doubt it. Thoughts?

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-02-09 19:15:57 pgsql: postgres_fdw: Push down joins to remote servers.
Previous Message Tom Lane 2016-02-09 18:45:04 Re: Bug in StartupSUBTRANS