On Wed, 11 Feb 2004, Mark Lubratt wrote:
> Interesting discussions on IDE drives and their write caches.
> I have a question...
> You mentioned that you'd see the problem during a large number of
> concurrent transactions. My question is, is this a necessary condition
> for the database crashing when the plug was pulled, or did you need use
> a large number of concurrent transactions to "guarantee" that when you
> pulled the plug, that it would be at an inopportune time? In other
> words, is an IDE drive still "more" susceptible to a power outage
> problem even under light load?
Basically, if the data has been written to WAL, and an fsync issued, and
the drive has it in cache, but hasn't written it to the platters, and you
lose power, the database will likely be corrupted and will refuse to
startup when the machine boots up. Also, of course, some data will be
lost that was supposedly committed in a transaction.
So, yeah, the reason for having hundreds of open transactions is that it
makes the window of opportunity for a lying drive to corrupt the database.
So, yes, even under light load, you could have a corrupted database if you
lose power while a write is happening. Of course, if the database is
sitting idle at the time of the power outage then you're ok.
Funny little story. We had an electrician working above our main power
switch (the big box that switches us from line power, to UPS, to the
diesel generator) and said electrician clipped a piece of wire that fell
into the switch, shorting it out, and taking down our entire hosting
center (think $1,000 a minute...)
As I was walking down a hallway, one of the winders / fox pro guys asked
me if my machine would come back up when the power came on (it runs on
dial 36 gig 10krpm SCSI drives under an LSI megaraid with battery backed
cache, and I've tested it pulling the plug before going production.) I'd
been bragging to him about the power plug pull tests it had passed, so of
course, he's just teasing me.
I told him that as long as the power cut hadn't spiked the box and fried
anything we were gold.
An hour later when they got the switch fixed and everything came back up,
my machine came up fine, but the NAS machines that provide the web storage
behind it (not the database, that's local) took about 10 minutes to fsck
or mount or whatever it is they do.
So I'm walking by foxpro guy's desk and I casually say "Well, looks like
my box had some problems coming back up." He smiles, thinking he's got
me, the bragging postgresql guy, by the short ones. "yeah, seems it boots
faster than the network storage it sits on. Just CTRL-ALT-DEL and it was
up and running fine." He laughed along with me. I trust Postgresql. On
SCSI or RAID with battery backed cache.
In response to
pgsql-admin by date
|Next:||From: scott.marlowe||Date: 2004-02-11 15:56:28|
|Subject: Re: hanging for 30sec when checkpointing|
|Previous:||From: Veera Sivakumar||Date: 2004-02-11 15:12:53|
|Subject: No space left on device|