Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup
Date: 2011-06-01 14:51:57
Message-ID: 18441.1306939917@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Wed, Jun 1, 2011 at 12:30 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> To address this corner
>>> case, we should check whether postmaster is really running by sending
>>> the signal 0 after we read postmater.pid file? Attached patch does that.

>> I find myself unimpressed by this approach, because it supposes that the
>> postmaster got as far as creating postmaster.pid.

> Sorry, I could not understand the reason why you were unimpressed.
> Could you explain it in a little more detail?

[ thinks some more... ] Actually, there's more merit to your suggestion
than I saw at first, but it's still got an issue. We can divide
postmaster failures into four cases:
1. postmaster fails before creating postmaster.pid,
and there was no pre-existing postmaster.pid file
2. postmaster fails before creating postmaster.pid,
but there was a pre-existing postmaster.pid file
3. postmaster fails after creating postmaster.pid,
and successfully removes postmaster.pid
4. postmaster fails after creating postmaster.pid,
and fails to remove postmaster.pid
The current HEAD code will detect 1 and 3 (after 5 seconds), and will
detect case 2 by virtue of noticing a stale timestamp in the old
pidfile; but it will wait till timeout in case 4. If we add your
suggestion to what's there now, it will cover case 4. It doesn't cover
case 1, and might not cover case 3 (if the pidfile was there for so
short a time that we never saw it) but that really isn't a problem
because the existing timeout logic handles those cases.

The problem I've got with the proposed change is that it's brittle
against case 2: it might pick up a PID from a stale pidfile and then
conclude that the postmaster died, when actually the postmaster hasn't
yet written a new pidfile. However, the existing code is also brittle
in this case, because when it sees that the pidfile is stale, it
immediately fails.

I think we can make it better by simply ignoring a pidfile with a stale
timestamp (hoping for it to be overwritten), and remembering the PID to
try kill(pid, 0) on from the first time we successfully parse the file.

regards, tom lane

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2011-06-01 17:09:26 pgsql: Further improvements in pg_ctl's new wait-for-postmaster-start l
Previous Message Heikki Linnakangas 2011-06-01 11:10:28 Re: pgsql: Make decompilation of optimized CASE constructs more robust.

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2011-06-01 15:40:21 Re: [BUGS] BUG #6034: pg_upgrade fails when it should not.
Previous Message Steve Singer 2011-06-01 14:02:19 Re: pg_listener in 9.0