IDE Drives and fsync

From: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: IDE Drives and fsync
Date: 2003-10-08 15:59:34
Message-ID: Pine.LNX.4.33.0310080945490.13727-100000@css120.ihs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

OK, I've done some more testing on our IDE drive machine.

First, some background. The hard drives we're using are Seagate
drives, model number ST380023A. Firmware version is 3.33. The machine
they are in is running RH9. The setup string I'm feeding them on startup
right now is: hdparm -c3 -f -W1 /dev/hdx

where:

-c3 sets I/O to 32 bit w/sync (uh huh, sure...)
-f sets the drive to flush buffer cache on exit
-W1 turns on write caching

The drives come up using DMA. turning unmask IRQ on / off has no affect
on the tests I've been performaing.

Without the -f switch, data corruption due to sudden power down is an
almost certain. Running 'pgbench -c 5 -t 1000000' and pulling the plug
will result in recovery failing with the typical invalid page type
messages.

the pgbench database was originally set to -s 1 when initializing.

If I turn off write caching (-W0) then the data is coherent no matter how
many concurrents I'm running, but performance is abysmal (drops from ~ 200
tps down to 45, 10 if I'm using /dev/md0, a mirror set.) This is all on a
single drive.

If I use -W1 and -f, then I get corruption on about every 4th test or so
if the number of parallel beaters is 50 or so. If I crank it up to 200 or
increase the size of the database by using -s 10 during initilization.
Note that EITHER a larger test database OR a larger number of clients
seems to increase the chance of corruption.

I'm guessing that the with -W1 and -f, what's happening is that at lower
levels of parallel access, or a larger data set, the time between when the
drive reports and fsync and when it actually writes the data out is
climbing, and it is more likely that data that is in transit to the wal is
getting lost during the power plug pull.

Tom, you had mentioned adding a delay of some kind to the fsync logic, and
I'd be more than willing to try out any patch you'd like to toss out to me
to see if we can get a semi-stable behaviour out of IDE drives with the
-W1 and -f switches turned on. As it is, the performance is quite good,
and under low to medium loads, it seems to be capable of surviving the
power plug being pulled, so I'm wondering if we can come up with a slight
delay, that might drop the performance some small percentage while
greatly decreasing the chance of data corruption.

Is this worth looking into? I can see plenty of uses for a machine that
runs on IDE for cost savings, while still providing a reasonable amount of
data security in case of power failure, but I'm not sure if we can get rid
of the problem completely or not.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message monu_indian 2003-10-08 16:05:05 index changing by unbalanced tree
Previous Message Jeff 2003-10-08 15:46:09 Re: Sun performance - Major discovery!