Skip site navigation (1) Skip section navigation (2)

Re: production server down

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>,"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 00:01:22
Message-ID: 7174.1103414482@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackers
Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> These values (from the corrupt pg_control file) are strange:

>> pg_control last modified:             Tue Dec 14 15:39:26 2004
>> Time of latest checkpoint:            Tue Nov  2 17:05:32 2004

The "last modified" date doesn't prove a lot because it would have
been updated when we set the "state" to "shutting down", just before
the panic occurred when we noticed there wasn't any WAL segment file
where pg_control said there should be one.  The "latest checkpoint"
is mighty interesting though.

I think Alvaro's idea that this copy of pg_control got created when the
NFS mount was offline is a real good theory.  However, it would seem
that that was quite some time ago (Nov 2 if not earlier), which would
suggest that the mount instability problem has been around longer than
Joe realizes :-(

If the bogus copy is indeed hiding underneath the mount point, then the
sequence of events last week is easy to explain:
	* system boots
	* NFS mount takes awhile to come online
	* Postgres starts and reads the bogus pg_control into memory;
	  then it just sits there since they didn't try to start any
	  data loading tasks right away
	* eventually NFS mount comes online
	* next day, admin decides to shut down Postgres
	* Postgres changes last-mod date and state in its in-memory
	  pg_control, and writes it out, overwriting the "good" copy
	  on the NFS server
	* Postgres then panics because there's no WAL file where
	  pg_control indicates the shutdown checkpoint WAL record
	  should go
	* and now we're in the state Joe documented

So one thing I'd strongly suggest is stopping Postgres and dismounting
the NFS server to see what's under there.  If there is a valid-looking
PGDATA directory under there, you definitely want to get rid of it to
reduce the risk of this happening again.

			regards, tom lane

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2004-12-19 00:12:31
Subject: Re: production server down
Previous:From: Joe ConwayDate: 2004-12-18 23:53:06
Subject: Re: production server down

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group