Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-09 22:53:59
Message-ID: 24722.1455058439@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I'm not sure whether there's anything to be gained by leaving the tracing
> code in there till we see actual buildfarm fails. There might be another
> slowdown mechanism somewhere, but I rather doubt it. Thoughts?

Hmmm ... I take that back. AFAICT, the failures on Noah's AIX zoo are
sufficiently explained by the "mdpostckpt takes a long time after the
regression tests" theory. However, there is something else happening
on axolotl. Looking at the HEAD and 9.5 branches, there are three very
similar failures in the ECPG step within the past 60 days:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39

In all three, we got "pg_ctl: server does not shut down", but the
postmaster log claims that it shut down, and pretty speedily too.
For example, in the 2015-12-12 failure,

LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: checkpoint starting: shutdown immediate
LOG: checkpoint complete: wrote 176 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.000 s, total=0.059 s; sync files=0, longest=0.000 s, average=0.000 s; distance=978 kB, estimate=978 kB
LOG: database system is shut down

We have no theory that would account for postmaster shutdown stalling
after the end of ShutdownXLOG, but that seems to be what happened.
How come? Why does only the ECPG test seem to be affected?

It's also pretty fishy that we have three failures in 60 days on HEAD+9.5
but none before that, and none in the older branches. That smells like
a recently-introduced bug, though I have no idea what.

Andrew, I wonder if I could prevail on you to make axolotl run "make
check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
see if the logging I added tells anything useful about this.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-02-09 23:30:02 Re: Tracing down buildfarm "postmaster does not shut down" failures
Previous Message Jim Nasby 2016-02-09 22:41:58 Re: proposal: schema PL session variables