Re: Unreliable "pg_ctl -w start" again

From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unreliable "pg_ctl -w start" again
Date: 2012-01-28 02:36:18
Message-ID: 4F0633450B7A4741B348040C8831EC90@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Well, feel free to increase that duration if you want. The reason it's
> there is to not wait for a long time if the postmaster falls over
> instantly at startup, but in a non-interactive situation you might not
> care.

Yes, just lengthening the wait duration causes unnecessary long wait when we
run pg_ctl interactively. Therefore, the current wait approach is is not
correct.

>> How about inserting postmaster_is_alive() as below?
>
> Looks like complete nonsense to me, if the goal is to behave sanely when
> postmaster.pid hasn't been created yet. Where do you think get_pgpid
> gets the PID from?

Yes, I understand that get_pgpid() gets the pid from postmaster.pid, which
may be the pid of the previous postmaster that did not stop cleanly.

I think my simple fix makes sense to solve the problem as follows. Could you
point out what might not be good?

1.The previous postmaster was terminated abruptly due to OS shutdown,
machine failure, etc. leaving postmaster.pid.
2.Run "pg_ctl -w start" to start new postmaster.
3.do_start() of pg_ctl reads the pid of previously running postmaster from
postmaster.pid. Say, let it be pid-1 (old_pid in code) here.

old_pid = get_pgpid();

4.Anyway, try to start postmaster by calling start_postmaster().
5.If postmaster.pid existed at step 3, it means either of:

(a) Previous postmaster did not stop cleanly and left postmaster.pid.
(b) Another postmaster is already running in the data directory (since
before running pg_ctl -w start this time.)

But we can't distinguish between them. Then, we read ostmaster.pid again to
judge the situation. Let it be pid-2 (pid in code).

if (old_pid != 0)
{
pg_usleep(1000000);
pid = get_pgpid();

6.If pid-1 != pid-2, it means that the situation (a) applies and the newly
started postmaster overwrote old postmaster.pid. Then, try to connect to
postmaster.

If pid-1 == pid-2, it means either of:

(a') Previous postmaster did not stop cleanly and left postmaster.pid. Newly
started postmaster will complete startup, but hasn't overwritten
postmaster.pid yet.
(b) Another postmaster is already running in the data directory (since
before running pg_ctl -w start this time.)

The current comparison logic cannot distinguish between them. In my problem
situation, situation a' happened, and pg_ctl mistakenly exited.

if (pid == old_pid)
{
write_stderr(_("%s: could not start server\n"
"Examine the log output.\n"),
progname);
exit(1);
}

7.To distinguish between a' and b, check if pid-1 is alive. If pid-1 is
alive, it means situation b. Otherwise, that is situation a'.

if (pid == old_pid && postmaster_is_alive(old_pid))

However, the pid of newly started postmaster might match the one of old
postmaster. To deal with that situation, it may be better to check the
modified timestamp of postmaster.pid in addition.

What do you think?

> If we had the postmaster's PID a priori, we could detect postmaster
> death directly instead of having to make assumptions about how long
> is reasonable to wait for the pidfile to appear. The problem is that
> we don't want to write a complete replacement for the shell's command
> line parser and I/O redirection logic. It doesn't look like a small
> project.

Yes, I understand this. I don't think we can replace shell's various work.

> (But maybe we could bypass that by doing a fork() and then having
> the child exec() the shell, telling it to exec postmaster in turn?)

Possibly. I hope this works. Then, we can pass unnamed pipe file descriptors
to postmaster via environment variables from the pg_ctl's forked child.

> And of course Windows as usual makes things twice as hard, since we
> couldn't make such a change unless start_postmaster could return the
> proper PID in that case too.

Well, we can make start_postmaster() return the pid of the newly created
postmaster. CreateProcess() sets the process handle in the structure passed
to it. We can pass the process handle to WaitForSingleObject8) to know
whether postmaster is alive.

Regards
MauMau

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2012-01-28 02:36:41 Re: Unreliable "pg_ctl -w start" again
Previous Message Thom Brown 2012-01-28 01:53:24 Temp file missing during large pgbench data set