Re: Shutting down a warm standby database in 8.2beta3

From: Stephen Harris <lists(at)spuddy(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Shutting down a warm standby database in 8.2beta3
Date: 2006-11-22 18:56:23
Message-ID: 20061122185623.GA23202@pugwash.spuddy.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote:
>
> kill(child_pid, SIGxxx);
> #ifdef HAVE_SETSID
> kill(-child_pid, SIGxxx);
> #endif
>
> In the normal case where the child has already completed setsid(), the
> extra signal sent to it should do no harm. In the startup race

Hmm. It looks like something more than this may be needed. The postgres
recovery process appears to be ignoring it. I ran the whole database
in it's own process group (ksh runs processes in their own process group
by default, so pg_ctl became the session leader and so everything under
pg_ctl all stayed in that process group).

% ps -o pid,ppid,pgid,args -g 29141 | sort
PID PPID PGID COMMAND
29145 1 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29146 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29147 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
29501 29147 29141 sh -c /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYXLOG
29502 29501 29141 /bin/ksh -p /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYX
29537 29502 29141 sleep 5

I did
kill -QUIT -29141 ; sleep 1 ; touch /export/home/swharris/archives/STOP_SWEH_RECOVERY

This sent the QUIT signal to all those processes. The shell script ignores
it and so tries to start again, so the 'touch' command tells it to exit(1)
rather than loop again.

The log file (the timestamp entries are from my 'rr' program so I
can see what it's doing)...

To start with we see a normal recovery:

Wed Nov 22 13:41:20 EST 2006: Attempting to restore 000000010000000100000056
Wed Nov 22 13:41:25 EST 2006: Finished 000000010000000100000056
LOG: restored log file "000000010000000100000056" from archive
Wed Nov 22 13:41:25 EST 2006: Attempting to restore 000000010000000100000057
Wed Nov 22 13:41:25 EST 2006: Waiting for file to become available

Now I send the kill signal...

LOG: received immediate shutdown request

We can see that the sleep process got it!
/export/home/swharris/rr[37]: 29537 Quit(coredump)
And my script detects the trigger file
Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found

Now database recovery appears to continue as normal; the postgres
recovery processes are still running, despite having received SIGQUIT

LOG: could not open file "pg_xlog/000000010000000100000057" (log file 1, segment 87): No such file or directory
LOG: redo done at 1/56000070
Wed Nov 22 13:43:51 EST 2006: Attempting to restore 000000010000000100000056
Wed Nov 22 13:43:55 EST 2006: Finished 000000010000000100000056
LOG: restored log file "000000010000000100000056" from archive
LOG: archive recovery complete
LOG: database system is ready
LOG: logger shutting down

pg_xlog now contains 000000010000000100000056 and 000000010000000100000057

A similar sort of thing happens if I use SIGTERM rather than SIGQUIT

I'm out of here in an hour, so for all you US based people, have a good
Thanksgiving holiday!

--

rgds
Stephen

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bob Pawley 2006-11-22 19:24:16 Re: Uninstalling PostgreSql
Previous Message Brandon Aiken 2006-11-22 18:55:55 Re: MSSQL to PostgreSQL : Encoding problem

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2006-11-22 18:58:34 Re: Integrating Replication into Core
Previous Message Markus Schiltknecht 2006-11-22 18:56:03 Re: Integrating Replication into Core