Re: occasional startup failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: occasional startup failures
Date: 2012-03-25 16:59:03
Message-ID: 16777.1332694743@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Every so often buildfarm animals (nightjar and raven recently, for
> example) report failures on starting up the postmaster. It appears that
> these failures are due to the postmaster not creating the pid file
> within 5 seconds, and so the logic in commit
> 0bae3bc9be4a025df089f0a0c2f547fa538a97bc kicks in. Unfortunately, when
> this happens the postmaster has in fact sometimes started up, and the
> end result is that subsequent buildfarm runs will fail when they detect
> that there is already a postmaster listening on the port, and without
> manual intervention to kill the "rogue" postmaster this continues endlessly.

> I can probably add some logic to the buildfarm script to try to detect
> this condition and kill an errant postmaster so subsequent runs don't
> get affected, but that seems to be avoiding a problem rather than fixing
> it. I'm not sure what we can do to improve it otherwise, though.

Yeah, this has been discussed before. IMO the only real fix is to
arrange things so that the postmaster process is an immediate child of
pg_ctl, allowing pg_ctl to know its PID directly and not have to rely
on the pidfile appearing before it can detect whether the postmaster
is still alive. Then there is no need for a guesstimated timeout.
That means not using system() anymore, but rather fork/exec, which
mainly implies having to write our own code for stdio redirection.
So that's certainly doable if a bit tedious. I have no idea about
the Windows side of it though.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2012-03-25 17:02:47 Re: occasional startup failures
Previous Message Andres Freund 2012-03-25 16:15:34 Re: Command Triggers, v16