Skip site navigation (1) Skip section navigation (2)

Re: production server down

From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>,Michael Fuhr <mike(at)fuhr(dot)org>,"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 02:51:56
Message-ID: 41C4ECCC.20300@joeconway.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Tom Lane wrote:
> I think Alvaro's idea that this copy of pg_control got created when the
> NFS mount was offline is a real good theory.  However, it would seem
> that that was quite some time ago (Nov 2 if not earlier), which would
> suggest that the mount instability problem has been around longer than
> Joe realizes :-(

I'm starting to wonder if this has somehow happened once or even twice 
before, each time the server was restarted. The timing of services 
starting on boot might have been biting us all along.

> If the bogus copy is indeed hiding underneath the mount point, then the
> sequence of events last week is easy to explain:
> 	* system boots
> 	* NFS mount takes awhile to come online
> 	* Postgres starts and reads the bogus pg_control into memory;
> 	  then it just sits there since they didn't try to start any
> 	  data loading tasks right away
> 	* eventually NFS mount comes online
> 	* next day, admin decides to shut down Postgres
> 	* Postgres changes last-mod date and state in its in-memory
> 	  pg_control, and writes it out, overwriting the "good" copy
> 	  on the NFS server
> 	* Postgres then panics because there's no WAL file where
> 	  pg_control indicates the shutdown checkpoint WAL record
> 	  should go
> 	* and now we're in the state Joe documented
> 
> So one thing I'd strongly suggest is stopping Postgres and dismounting
> the NFS server to see what's under there.  If there is a valid-looking
> PGDATA directory under there, you definitely want to get rid of it to
> reduce the risk of this happening again.
> 

Perhaps we should purposefully place a root owned placeholder file there 
-- that way Postgres would refuse to start at all in this scenario.

BTW, the init script is indeed the one which automatically does initdb:

[...]
case "$1" in
     start)
         touch $LOGFILE
         chown postgres:postgres $LOGFILE
         chmod 0600 $LOGFILE
         if [ ! -f $DATADIR/PG_VERSION ]; then
             echo -n "Initializing the PostgreSQL database at location 
${DATADIR}"
             LANG_SYSCONFIG=/etc/sysconfig/language
             test -f "$LANG_SYSCONFIG" && . $LANG_SYSCONFIG
             LANG=${POSTGRES_LANG:-$RC_LANG}
             install -d -o postgres -g daemon -m 700 ${DATADIR} &&
             su - postgres -c "env -i LANG=$LANG initdb $DATADIR &> 
initlog" || rc_failed
[...]


Joe

In response to

Responses

pgsql-hackers by date

Next:From: Andrew DunstanDate: 2004-12-19 03:11:10
Subject: Re: production server down
Previous:From: Reini UrbanDate: 2004-12-19 01:07:10
Subject: Re: buildfarm improvements

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group