Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-10 00:24:18
Message-ID: 29564.1455063858@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Incidentally, as I noted earlier, the ecpg tests don't honour
> TEMP_CONFIG, and in axolotl's case this could well make a difference, as
> it it set up like this:
> ...
> So running it's not running with fsync off or using the ramdisk for
> stats_temp_directory.

Oooohh ...

I had just been looking into what else the postmaster does after "database
system is shut down" comes out. One of those activities is telling the
stats collector to shut down, and then waiting for it to do so. So a
really slow write of the collected stats might possibly account for
delaying things here.

What I was doing was adding some more logging in the
post-shutdown-checkpoint area, and this is what I see on dromedary,
when shutting down just after a regression test run:

LOG: database system is shut down at 2016-02-09 19:12:29.497 EST
LOG: doing before_shmem_exit 0 at 2016-02-09 19:12:29.498 EST
LOG: doing on_shmem_exit 2 at 2016-02-09 19:12:29.498 EST
LOG: doing on_shmem_exit 1 at 2016-02-09 19:12:29.498 EST
LOG: doing on_shmem_exit 0 at 2016-02-09 19:12:29.498 EST
LOG: doing on_proc_exit 1 at 2016-02-09 19:12:29.498 EST
LOG: doing on_proc_exit 0 at 2016-02-09 19:12:29.498 EST
LOG: calling exit(0) at 2016-02-09 19:12:29.498 EST
LOG: checkpointer dead at 2016-02-09 19:12:32.382 EST
LOG: all children dead at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 3 at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 2 at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 1 at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 0 at 2016-02-09 19:12:32.396 EST
LOG: doing on_proc_exit 1 at 2016-02-09 19:12:32.396 EST
LOG: doing on_proc_exit 0 at 2016-02-09 19:12:32.396 EST
LOG: lock files all released at 2016-02-09 19:12:32.397 EST
LOG: calling exit(0) at 2016-02-09 19:12:32.397 EST

The first batch of those "on_shmem/proc_exit" reports are from the
checkpointer process, and the second set are from the postmaster.
The stats collector shutdown would be between "checkpointer dead"
and "all children dead". We can see that on this machine, that's
not really an issue ... but how in the world did it take the postmaster
nigh three seconds to notice the checkpoint process exit? Something
very odd indeed there.

Anyway, I think I should push this additional instrumentation so you
can use it on axolotl.

This also makes it look like we really need to revisit where/when the
"database system is shut down" message is printed. But that's for
later.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-02-10 00:37:22 Re: Tracing down buildfarm "postmaster does not shut down" failures
Previous Message Michael Paquier 2016-02-10 00:18:22 Re: Support for N synchronous standby servers - take 2