Re: How to implement autostart of postgres?

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: "Gavrina, Irina" <igavrina(at)mera(dot)ru>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to implement autostart of postgres?
Date: 2011-07-29 04:38:23
Message-ID: 4E32393F.70401@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 28/07/11 18:06, Gavrina, Irina wrote:

> Yes, you are right, this is a question of database robustness.

Honestly, this is the wrong approach to high availability. Don't try to
make one machine unbreakable - you *cannot* do it. Use multi-machine
failover between a master and a replica with STONITH and a
heartbeat/monitoring setup. See Google for more info.

> You know,
> If any of postgres processes died the postmaster process will again
> start up it. The main worry is if the postmaster process dies (truth be
> told, I can reproduce it only using ‘kill –KILL’, which should be never
> done, I know) and all other processes are up and running. All created
> connections will go on with their work. And it means all db queries will
> be processed after the postmaster died. Am I correct?
>
> Is there any risk of losing of data or data corruption in this case?

Possibly, if you then remove postmaster.pid, thus bypassing the other
safety, and start a new postmaster.

I'm a bit alarmed to discover that Ubuntu's postgresql packaging appears
to do exactly this.

/usr/bin/pg_ctlcluster, from line 78:

# If a pid file is already present, delete it if it is stale/invalid,
# or exit with a notice if it belongs to an already running postmaster.
sub start_check_pid_file {
my $pidfile = $info{'pgdata'}.'/postmaster.pid';
if (check_running_postmaster $pidfile) {
print "Cluster is already running.\n";
exit 2;
}

# Remove invalid or stale PID file
if (-e $pidfile) {
unlink $pidfile;
print "Removed stale pid file.\n";
}
}

It merrily removes the pid file without any effort to check if there are
backends using that datadir, then goes on to relaunch the postmaster.

Is this sane? Safe?

I realize that they want to have Pg come back up properly after an OS
crash or hard power loss, where it doesn't get the chance to remove the
pid file. Fine. It also opens the system up to user "kill -9" abuse,
though...

> There is any way to configure the postgres db like ‘if the postmaster
> process died all child processes are stopped’?

The whole issue with kill -9 is that you're telling the system to bypass
*ALL* of the usual safeties and recover opportunities, including any
crash handler the program may have in place. You're denying the
postmaster any opportunity to warn the child processes that it is
crashing out.

"If I shoot my foot off with this shotgun, is there any way I can
automatically stop the bleeding and graft a new foot on?"

No. Don't shoot your foot off just because UNIX systems give you the
ability to do just that if you feel like it. You don't "dd if=/dev/zero
of=/dev/sdzz1" either... (target node changed in case any idiot tries to
run it anyway).

That said - a quick test shows that the postmaster dies without
signalling its children when killed with a SIGABRT (often raised by
glibc memory function errors, assertions, etc) too. A query sitting in a
big cartesian join merrily continues processing. Whether that applies to
queries that're doing write activity, I'm not sure.

Anyway, something like this only-very-briefly-tested bash code should
give you vaguely the right idea, though I *STRONGLY* recommend you use
external monitoring like nagios or the like, plus a proper failover
setup, instead of a hackish self-restart facility that's likely to make
things WORSE not better.

function capture_cash_info() {
# Capture `dmesg' output, the last 1000 lines of the
# postmaster logs, etc etc. You'll have to write this.
# The function should email a report to the admin for urgent
# examination.
echo "Capture crash report"
}

function scream_for_help() {
# Something unrecoverable happened. Email the admin, TXT them,
# activate their house alarm, and electrocute them awake, do
# whatever you usually have to do.
# You'll have to write this.
echo 'Help me! ARgh!'
}

function try_to_recover() {
# If you want to know it, $1 will contain "stopped" if pidfile
# missing, or "died" if pidfile present but stale. Can be used in
# admin email reports etc.
if test $(pgrep postgres|wc -l) -gt 0 ; then
echo 1>&2 "Postmaster dead but children still running!";
# Email/TXT the admin screaming for help and saying some idiot
# "kill -9"'d the postmaster
# If you felt brave you could SIGTERM the children here then
# next time the watchdog ran might be able to relaunch the
# postmaster if they exited. I wouldn't want to, though.
scream_for_help
else
capture_cash_info
# relaunch pg using your distro's init script or using pg_ctl,
# depending on how you usually do it
false || scream_for_help
fi
}

DATADIR=/var/lib/postgresql/8.4/main/
PID=$DATADIR/postmaster.pid
if ! test -e "$PID"
then
try_to_recover stopped
elif ! kill -0 `cat "$PID"` >&/dev/null; then
try_to_recover died
fi

If you use the above and it eats your data then eats your cat, don't
blame me. I told you not to use it.

--
Craig Ringer

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Maton, Brett 2011-07-29 07:57:49 Re: Fresh Restore - relation contains more than "max_fsm_pages"
Previous Message Craig Ringer 2011-07-29 03:56:04 Re: Fresh Restore - relation contains more than "max_fsm_pages"