Re: auto removing stale pid for postmaster NT service

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Sullivan <andrew(at)libertyrms(dot)info>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: auto removing stale pid for postmaster NT service
Date: 2002-09-16 21:27:38
Message-ID: 2208.1032211658@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Andrew Sullivan <andrew(at)libertyrms(dot)info> writes:
> This is because there is a process with the same pid as the
> postmaster. This will happen in cases where the machine crashes and
> starts up again -- something else happens to get the (former)
> postgres pid at startup, and so when postgres checks for a process
> with that pid, one exists. And kerplooey.

FYI, sendmail has the same restart failure mode; I imagine a lot of
other Unix daemons do too.

> I seem to recall that someone (maybe Tom Lane?) suggested an
> extension to the current pidfile check, so that it will also check to
> see if the process really is PostgreSQL. But I don't know if it was
> implemented.

It wasn't yet, mainly because it's not obvious how to tell reliably
whether some other process is a postmaster or not.

I think I had suggested distinguishing EPERM from other kill() errors,
which would tell us whether the other process is under the same userid
as us or not; if not, we could perhaps safely assume that it's not a
postmaster (or at least not one likely to be using our data directory).

Unfortunately, that doesn't really improve the odds very much. The
typical scenario for this problem is that the PID we get assigned will
wobble around by one or two counts from one boot cycle to the next,
depending on just how fast other startup processes manage to finish.
(If we get the exact same PID as before, there's no problem; the code
is smart enough to notice that case.) But the PID(s) adjacent to the
postmaster's will likely also belong to the postgres user --- consider
the shell that launched us, for example. The shell, or whatever it
might launch right after the postmaster, would look enough like a
postmaster to fool this simplistic test.

So I'm at a loss how the postmaster can improve the reliability of this
check, without throwing the baby out with the bathwater by making a
check that might fail to recognize a conflicting postmaster. The
consequences of that would be *dire*.

The best solution is probably to forcibly unlink the postmaster.pid
file in some startup script --- but it has to be a script that is *only*
run during boot, never anytime later. The postgres start script is
not the place for this.

regards, tom lane

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Peter Eisentraut 2002-09-16 23:09:45 Re: compiling pgsql
Previous Message Andrew Sullivan 2002-09-16 20:48:27 Re: psql database recovery error