Unreliable "pg_ctl -w start" again

From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Unreliable "pg_ctl -w start" again
Date: 2012-01-27 15:45:19
Message-ID: 996B1BE9112D45A48F5497E68D90BFFE@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Last year, I asked for your opinions about how to fix the bug of unreliable
"pg_ctl -w start", as in the thread:

http://archives.postgresql.org/pgsql-hackers/2011-05/msg01407.php

The phenomenon was that "pg_ctl -w start" did not return for 60 seconds when
postgresql.conf contained a wrong parameter specification.

Recently, I've encountered another problem of "pg_ctl -w start", which I
cannot reliably avoid. I found the cause in pg_ctl.c. I'm willing to create
a patch, but I'm concerned about the correctness of the fix. I desire this
bug will be eliminated as soon as possible. I'd like to ask your opinions.

[Problem]
I use PostgreSQL 8.3.12 embedded in a packaged application. The problem
occurred on RHEL5 when the operating system was starting up. The packaged
application is started from /etc/init.d/myapp. That application internally
executes "pg_ctl -w start" and checks its return value. The application does
not start unless the return value is 0.

The problematic phenomenon is that "pg_ctl -w start" fails with return value
1 in only two seconds without waiting until 60 seconds pass. That is, -w did
not work. However, the database server started successfully.

The timeline was as follows:

18:09:45 the application executed "pg_ctl -w start"
18:09:47 "pg_ctl -w start" returned with 1

<PostgreSQL's server log (dates are intentionally eliminated)>
18:10:01 JST 22995 LOG: database system was interrupted;last known up at
2012-01-21 02:24:59 JST
18:10:32 JST 22995 LOG: database system was not properly shut down;
automatic recovery in progress
18:10:34 JST 22995 LOG: record with zero length at 0/23E35D4
18:10:34 JST 22995 LOG: redo is not required
18:11:38 JST 22893 LOG: database system is ready to accept connections
18:11:38 JST 23478 LOG: autovacuum launcher started

PostgreSQL took a long time to start. This is probably because the system
load was high with many processes booting up concurrently during OS boot.

[Cause]
The following part in do_start() of pg_ctl.c contains a bug:

if (old_pid != 0)
{
pg_usleep(1000000);
pid = get_pgpid();
if (pid == old_pid)
{
write_stderr(_("%s: could not start server\n"
"Examine the log output.\n"),
progname);
exit(1);
}
}

This part assumes that postmaster will overwrite postmaster.pid within a
second. This assumption is not correct under heavy load like OS startup.

In PostgreSQL 9.1, the wait processing is largely modified. However, the
same assumption seems to still remain, though the duration is 5 seconds. 5
seconds of wait is probably insufficient for my case. I think no fixed
duration is appropriate.

[Solution]
So, what is the reliable solution? The pipe-based one, which I proposed in
the past thread, would be reliable. However, that is not simple enough to
back-port to 8.3.

How about inserting postmaster_is_alive() as below? I know this is not
perfect, but this will work in most cases. I need some solution that
pratically helps.

if (old_pid != 0)
{
pg_usleep(1000000);
pid = get_pgpid();
if (pid == old_pid && postmaster_is_alive(pid))
{
write_stderr(_("%s: could not start server\n"
"Examine the log output.\n"),
progname);
exit(1);
}
}

Regards
MauMau

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2012-01-27 15:48:11 Re: Speed dblink using alternate libpq tuple storage
Previous Message Marko Kreen 2012-01-27 15:42:14 Re: Speed dblink using alternate libpq tuple storage