Re: Upgrading my BSDI box, again

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Cc: "Kurt J(dot) Lidl" <lidl(at)pix(dot)net>, "Steven M(dot) Schultz" <sms(at)TO(dot)GD-ES(dot)COM>
Subject: Re: Upgrading my BSDI box, again
Date: 2003-07-30 04:38:54
Message-ID: 200307300438.h6U4cs925630@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


[ CC to Kurt and Steven on bsdi list.]

Guys, I just replied to this email on the BSDi email list. The issue is
that someone found that some(most?) IDE drives have write cache enabled,
though the drives do not preserve the write cache data on power failure.

I am surprised we have not heard of this failure before because I know
most vendors who ship PostgreSQL test our crash recovery thoroughly.
Are they testing only using SCSI drives?

Below you will read that my Seagate SCSI drive has write cache disabled,
but another guy has a Seagate IDE drive that has it enabled, though it
loses data on power failure.

Scarey!

Does anyone have any more detailed information on this?

---------------------------------------------------------------------------

Bruce Momjian wrote:
>
> I can tell you that if this was posted on a PostgreSQL mailing list, we
> would be freaking out!
>
> Having the data permanently on the disk is important for soft updates,
> but it is critical for databases. We go to great lengths to make sure
> the PostgreSQL write-ahead log is fsync'ed to the disk. When we report
> a transaction as committed, we expect to have that transaction reliably
> recorded even if you pull the plug on the computer right after the
> transaction completes. With the write cache enabled and not preserved
> on power failure, re-powering a system could leave a transaction
> partially completed, e.g. if you move $500 from one account to another,
> the money might show as removed from the original account but not appear
> in the new account. This, of course, is a disaster for a database that
> must be 100% reliable.
>
> I have heard of drives that either have battery backup for their cache,
> or have enough battery to guarantee the data gets written to the
> platters, or uses the rotational energy of the spinning platters to
> write the data to the platters. I assumed that drive manufacturers
> where honest in enabling write cache by default only on drives that
> support a guaranteed safe cache.
>
> On my SCSI SEAGATE ST336607LW, I see:
>
> $ scsicmd -c msel -p all -f /dev/rsd0h
> ...
> disk caching page [pcode=0x08] (dcp):
>
> parameters saveable [sense] (ps): 0
> mode page code (mpcode): 0
> mode page length (mpl): 0
> write cache enable (wce): 0
>
> so it seems SCSI drives are being honest. It is a shame to learn IDE
> drives are not, and particularly when you can't control it using
> something like 'scsicmd'.
>
> I know that most of the vendors who ship PostgreSQL repeatedly test our
> crash recovery capabilities, and I have not heard of anyone reporting it
> didn't work, so I assume they are testing on SCSI or honest IDE drives.
>
> Seeing that Kurt is testing with a Seagate IDE drive, and Seagate is
> honest in their SCSI setting of disabled by default of the write-ahead
> cache, he might be right that IDE by default enables the cache because
> they assume it is for a non-critical desktop machine. However, I know
> of many who use IDE drives for mission-critical servers, particularly in
> RAID configurations, so this is a serious concern.
>
> Because this was posted to the public BSDI list, I will CC it over to
> the PostgreSQL hackers list for comment, and keep Kurt in the CC.
>
> ---------------------------------------------------------------------------
>
> Kurt J. Lidl wrote:
> > On Tue, Jul 29, 2003 at 01:31:09PM -0700, Steven M. Schultz wrote:
> > > > From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
> > > >
> > > > The big question is whether that write cache is preserved if power is
> > > > removed from the drive. If it isn't, I don't think enabling it is a
> > > > good idea.
> > >
> > > Usually if a drive loses power that means the system has also
> > > lost power. The 4-8MB of data in the drive's cache would be lost
> > > along with the 100 to 200MB in the system's buffer cache.
> > >
> > > I've not had a problem with enabling the write cache in the drive but
> > > if it's a concern then using CTQ instead is equally effective and
> > > all one worries about is losing the system buffer cache on a power
> > > fail.
> >
> > Well, I cannot imagine that the difference between write-caches is
> > significant between IDE and SCSI drives in this context. On modern
> > IDE drives, the write cache is almost always on. With softupdates
> > enabled, any power loss (or even a hard reset of the machine) will
> > curdle the filesystem repeatably. Under 5.0 this is especially true,
> > in my experience. I cannot imagine running with the write cache
> > enabled on machines where I care about getting the data back off the
> > drive reliably... Let me rephrase that -- where I care about the
> > machine rebooting with requireing a manual fsck of the filesystem.
> >
> > This problem is so severe that Chris Ross and I worked on a patch
> > that allows you to turn off the write cache on a drive by drive
> > basis. Now my 5.0 machine(s) are much, much, much more likely to
> > survive a reset or sudden poweroff and come back up without manual
> > intervention.
> >
> > I'll echo Henry's statement -- the lost of the buffer cache is somewhat
> > problematic, but the drive's lying about what's on disk and what is really
> > just in the write cache is catastrophic when softupdates are turn on.
> >
> > I've attached my bug report and fix for the 5.0 code. The same change
> > probably would "just work" under 4.3 too. Too bad this didn't make it
> > into the stock 5.0 source tree. Oh well.
> >
> > -Kurt
> >
> > ----- snip, snip -----
> > Date: Tue, 10 Dec 2002 18:54:25 -0500
> > From: "Kurt J. Lidl" <lidl(at)pix(dot)net>
> > To: tsa-beta-problems(at)windriver(dot)com
> > Cc: cross(at)distal(dot)com, "Kurt J. Lidl" <lidl(at)pix(dot)net>
> > Subject: turning off write cache on IDE drives
> > Message-ID: <20021210185425(dot)A9454(at)pix(dot)net>
> >
> > I've noticed a really horrible interaction with the softupdate
> > code in the 5.0 beta and 5.0 final when the disk drive has
> > a write-cache enabled.
> >
> > I'd call this a RFE, but since I'm also sending the enhancement at
> > the bottom of the message, it's not really a "request for enhancement",
> > so much as a "here's the code for an enhancement". So, CFAE
> > follows...
> >
> > This isn't really a problem with softupdates, but rather a problem in
> > that BSD/OS doesn't have a way of turning off the write-cache on
> > IDE disk drives.
> >
> > My test machine has a Seagate ST340180A drive in it, which is a
> > pretty run of the mill, 40GB disk drive. Naturally, it comes
> > with both the read cache and write cache enabled, and there isn't
> > a physical jumper on the drive to turn either cache off.
> >
> > To demonstrate a catastrophic filesystem failure, boot a machine
> > with with softupdates on /usr, that also has a drive write-cache.
> > Run "find /usr" -- this will dirty up a bunch of inodes with modified
> > access times for the files. Power off the machine to simulate a
> > sudden loss of power in the field.
> >
> > Power the machine on. Notice the almost 100% occurance of the error
> > "UNEXPECTED SOFT UPDATE INCONSISTANCY; RUN fsck MANUALLY". And notice
> > that part of your /usr is now gone. Repeat as often as you like,
> > eventually you'll run out of files in /usr to have unlinked :-)
> >
> > The solution to this rather horrific interaction between softupdates
> > and the write cache in the drive is to turn off the write cache.
> > This seems to solve the problem.
> >
> > Here's the patch that Chris Ross and I worked up to turn off the
> > write cache on a drive. With this patch, the user can set a line
> > like:
> >
> > -parm wd0 disable_wrcache=yes
> >
> > In their /etc/boot.default and have a much more robust system in
> > the face of sudden power loss (when using a modern IDE drive).
> >
> > It might be more clever to check for the write cache before attempting
> > to turn it off, but the following patch is tested and works.
> >
> > -Kurt
> >
> > --- wd.c 2002/12/10 22:18:47 1.1
> > +++ wd.c 2002/12/10 23:16:38 1.2
> > @@ -133,6 +133,7 @@
> > #define PARM4_WD_ISMAPPED 1
> > #define PARM4_WD_USEDMA 2
> > #define PARM4_WD_USELBA 3
> > +#define PARM4_WD_DISWRCACHE 4
> >
> > /*
> > * Drive states. Used for open and format operations.
> > @@ -509,6 +510,21 @@
> > (*prfunc)("%s%d*%d",
> > sep, (wp->wdp_fixedcyl + wp->wdp_removcyl) *
> > wp->wdp_heads * wp->wdp_sectors, WD_SECSIZE);
> > +
> > + if ((mp = getparamfor(sc->wd_dk.dk_dev.dv_xname,
> > + PARM4_WD_DISWRCACHE)) && (*mp != 0)) {
> > + (*prfunc)(": write cache disabled");
> > + /* Set the "disable write cache" feature */
> > + outb(sc->wd_iobase + wd_feature, 0x82);
> > + /* Run the command, wait for it, and eat the interrupt */
> > + outb(sc->wd_iobase + wd_command, WDCC_SETFEATURES);
> > + if (wd_wait_nbusy(sc->wd_iobase)) {
> > + printf("%s: Timeout disabling drive write cache\n",
> > + sc->wd_dk.dk_dev.dv_xname);
> > + }
> > + DELAY(1000); /* Sometimes BUSY clears before interrupt */
> > + inb(sc->wd_iobase + wd_status);
> > + }
> > out:
> > printf("\n");
> > }
> > --- wd.4 2002/12/10 22:18:47 1.1
> > +++ wd.4 2002/12/10 23:32:34 1.2
> > @@ -85,6 +85,12 @@
> > .Do no Dc ,
> > Logical-Block Addressing is disabled, and Cylinder/Head/Sector addressing
> > is used instead.
> > +.It disable_wrcache
> > +If this parameter is specified as
> > +.Do yes Dc ,
> > +the drive will be sent a SET_FEATURE command which will attempt to turn
> > +off the write cache on the drive. With the write cache enabled, a
> > +sudden power loss can cause catastrophic filesystem failure.
> > .El
> > .Sh FILES
> > .Bl -tag -width /dev/rwd[0-7][a-h] -compact
> > --- /usr/src/etc/etc.i386/boot.define Tue Jun 26 17:13:01 2001
> > +++ /etc/boot.define Mon Dec 9 19:24:44 2002
> > @@ -294,6 +294,9 @@
> > wd use_lba 3
> > include generic yesno
> > ;
> > +wd disable_wrcache 4
> > + include generic yesno
> > +;
> > wdpi use_dma 2
> > include generic yesno
> > ;
> >
> > ----- snip, snip -----
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: bsdi-users-unsubscribe(at)mailinglists(dot)org
> > For additional commands, e-mail: bsdi-users-help(at)mailinglists(dot)org
> >
> >
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sean Chittenden 2003-07-30 05:05:34 [PATCH] Re: Why READ ONLY transactions?
Previous Message Christopher Kings-Lynne 2003-07-30 02:59:04 bug in current_timestamp(1)