Re: Better detection of staled postmaster.pid

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Pavel Raiskup <praiskup(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Better detection of staled postmaster.pid
Date: 2015-08-31 14:34:24
Message-ID: CAKFQuwZyzi+R2BYC2WvX7i2fkNfjtCr1qqFK_0-U+HMDezEqKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 31, 2015 at 10:20 AM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> Pavel Raiskup <praiskup(at)redhat(dot)com> wrote:
>
> > It's been reported [1] that postmaster fails to start against staled
> > postmaster.pid after (e.g.) power outage on Fedora, its due to init
> system
> > parallelism and "some" other newly started process can already have
> allocated
> > the same PID as the old postmaster had -- and in this case postmaster
> refuses
> > to delete staled pidfile (which is expected as we need to be really
> > careful).
> >
> > Don't you see some other possible check we could implement to guarantee
> that
> > the PID mentioned in postmaster.pid does not hide concurrent postmaster?
>

​Most of this can be gleamed from the linked bug report.​..

Was the other newly started process another PostgreSQL cluster?
>

Yes​​

Was it launched under the same OS user? (Those are the only
> conditions under which I've seen this.) I think it is wise to use
> a separate OS user for each cluster.
>

​Yes. Does the pid check that the owner of the pid file match the owner of
the process? While seemingly good advice I'm not sure how it would prevent
this scenario - likely due to lack of knowledge on my part.

>
> If it's not a matter of multiple clusters running under the same OS
> user, please provide more deails, like the specific version and
> copy/paste of error messages and relevant log entries
>

​See report. I get not having transient data linked to in these kinds of
postings but the supplied description and official downstream project bug
report seem like sufficient data work operate from even if only in a
preliminary fashion.

The only obvious solution is to stop using (pid) as a primary key of sorts
and use (pid, timecreated) instead. After a restart/reboot the timecreated
would be guaranteed to have changed and no guessing would be involved.
That seems invasive, though proper, for a problem largely limited to an
uncommon distribution-specific setup that requires a unclean shutdown to
occur.

David J.


In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shulgin, Oleksandr 2015-08-31 14:39:46 Re: Adding since-version tags to the docs?
Previous Message Tom Lane 2015-08-31 14:31:13 Re: Buildfarm failure from overly noisy warning message