Skip site navigation (1) Skip section navigation (2)


From: david(at)lang(dot)hm
To: Ron <rjpeace(at)earthlink(dot)net>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: SCSI vs SATA
Date: 2007-04-07 21:42:47
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Sat, 7 Apr 2007, Ron wrote:

> The reality is that all modern HDs are so good that it's actually quite rare 
> for someone to suffer a data loss event.  The consequences of such are so 
> severe that the event stands out more than just the statistics would imply. 
> For those using small numbers of HDs, HDs just work.
> OTOH, for those of us doing work that involves DBMSs and relatively large 
> numbers of HDs per system, both the math and the RW conditions of service 
> require us to pay more attention to quality details.
> Like many things, one can decide on one of multiple ways to "pay the piper".
> a= The choice made by many, for instance in the studies mentioned, is to 
> minimize initial acquisition cost and operating overhead and simply accept 
> having to replace HDs more often.
> b= For those in fields were this is not a reasonable option (financial 
> services, health care, etc), or for those literally using 100's of HD per 
> system (where statistical failure rates are so likely that TLC is required), 
> policies and procedures like those mentioned in this thread (paying close 
> attention to environment and use factors, sector remap detecting, rotating 
> HDs into and out of roles based on age, etc) are necessary.
> Anyone who does some close variation of "b" directly above =will= see the 
> benefits of using better HDs.
> At least in my supposedly unqualified anecdotal 25 years of professional 
> experience.

Ron, why is it that you assume that anyone who disagrees with you doesn't 
work in an environment where they care about the datacenter environment, 
and aren't in fields like financial services? and why do you think that we 
are just trying to save a few pennies? (the costs do factor in, but it's 
not a matter of pennies, it's a matter of tens of thousands of dollars)

I actually work in the financial services field, I do have a good 
datacenter environment that's well cared for.

while I don't personally maintain machines with hundreds of drives each, I 
do maintain hundreds of machines with a small number of drives in each, 
and a handful of machines with a few dozens of drives. (the database 
machines are maintained by others, I do see their failed drives however)

it's also true that my expericance is only over the last 10 years, so I've 
only been working with a few generations of drives, but my experiance is 
different from yours.

my experiance is that until the drives get to be 5+ years old the failure 
rate seems to be about the same for the 'cheap' drives as for the 'good' 
drives. I won't say that they are exactly the same, but they are close 
enough that I don't believe that there is a significant difference.

in other words, these studies do seem to match my experiance.

this is why, when I recently had to create some large capacity arrays, I'm 
only ending up with machines with a few dozen drives in them instead of 
hundreds. I've got two machines with 6TB of disk, one with 8TB, one with 
10TB, and one with 20TB. I'm building these sytems for ~$1K/TB for the 
disk arrays. other departments sho shoose $bigname 'enterprise' disk 
arrays are routinely paying 50x that price

I am very sure that they are not getting 50x the reliability, I'm sure 
that they aren't getting 2x the reliability.

I believe that the biggest cause for data loss from people useing the 
'cheap' drives is due to the fact that one 'cheap' drive holds the 
capacity of 5 or so 'expensive' drives, and since people don't realize 
this they don't realize that the time to rebuild the failed drive onto a 
hot-spare is correspondingly longer.

in the thread 'Sunfire X4500 recommendations' we recently had a discussion 
on this topic starting from a guy who was asking the best way to configure 
the drives in his sun x4500 (48 drive) system for safety. in that 
discussion I took some numbers from the cmu study and as a working figure 
I said a 10% chance for a drive to fail in a year (the study said 5-7% in 
most cases, but some third year drives were around 10%). combining this 
with the time needed to write 750G useing ~10% of the systems capacity 
results in a rebuild time of about 5 days. it turns out that there is 
almost a 5% chance of a second drive failing in a 48 drive array in this 
time. If I were to build a single array with 142G 'enterprise' drives 
instead of with 750G 'cheap' drives the rebuild time would be only 1 day 
instead of 5, but you would have ~250 drives instead of 48 and so your 
chance of a problem would be the same (I acknoledge that it's unlikly to 
use 250 drives in a single array, and yes that does help, however if you 
had 5 arrays of 50 drives each you would still have a 1% chance of a 
second failure)

when I look at these numbers, my reaction isn't that it's wrong to go with 
the 'cheap' drives, my reaction is that single reducndancy isn't good 
enough. depending on how valuble the data is, you need to either replicate 
the data to another system, or go with dual-parity redundancy (or both)

while drives probably won't be this bad in real life (this is after all, 
slightly worse then the studies show for their 3rd year drives, and 
'enterprise' drives may be slightly better) , I have to assume that they 
will be for my reliability planning.

also, if you read throught the cmu study, drive failures were only a small 
percentage of system outages (16-25% depending on the site). you have to 
make sure that you aren't so fixated on drive reliabilty that you fail to 
account for other types of problems (down to and including the chance of 
someone accidently powering down the rack that you are plugged into, be 
it from hitting a power switch, to overloading a weak circuit breaker)

In looking at these problems overall I find that in most cases I need to 
have redundant systems with the data replicated anyway (with logs sent 
elsewhere), so I can get away with building failover pairs instead of 
having each machine with redundant drives. I've found that I can 
frequently get a pair of machines for less money then other departments 
spend on buying a single 'enterprise' machine with the same specs 
(although the prices are dropping enough on the top-tier manufacturers 
that this is less true today then it was a couple of years ago), and I 
find that the failure rate is about the same on a per-machine basis, so I 
end up with a much better uptime record due to having the redundancy of 
the second full system (never mind things like it being easier to do 
upgrades as I can work on the inactive machine and then failover to work 
on the other, now, inactive machine). while I could ask for the budget to 
be doubled to provide the same redundancy with the top-tier manufacturers 
I don't do so for several reasons, the top two being that these 
manufacurers frequently won't configure a machine the way I want them to 
(just try to get a box with writeable media built in, either a floppy of a 
CDR/DVDR, they want you to use something external), and doing so also 
exposes me to people second guessing me on where redundancy is needed 
('that's only development, we don't need redundancy there', until a system 
goes down for a day and the entire department is unable to work)

it's not that the people who disagree with you don't care about their 
data, it's that they have different experiances then you do (experiances 
that come close to matching the studies where they tracked hundereds of 
thousands of drives of different types), and as a result believe that the 
difference (if any) between the different types of drives isn't 
significant in the overall failure rate (especially when you take the 
difference of drive capacity into account)

David Lang

P.S. here is a chart from that thread showing the chances of loosing data 
with different array configurations.

if you say that there is a 10% chance of a disk failing each year 
(significnatly higher then the studies listed above, but close enough) 
then this works out to ~0.001% chance of a drive failing per hour (a 
reasonably round number to work with)

to write 750G at ~45MB/sec takes 5 hours of 100% system throughput, or ~50 
hours at 10% of the system throughput (background rebuilding)

if we cut this in half to account for inefficiancies in retrieving data 
from other disks to calculate pairity it can take 100 hours (just over 
four days) to do a background rebuild, or about 0.1% chance for each disk 
of loosing a seond disk. with 48 drives this is ~5% chance of loosing 
everything with single-parity, however the odds of loosing two disks 
during this time are .25% so double-parity is _well_ worth it.

chance of loosing data before hotspare is finished rebuilding (assumes one 
hotspare per group, you may be able to share a hotspare between multiple 
groups to get slightly higher capacity)

> RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48           n/a                n/a
>             3          16           48           n/a         (0.0001% with manual replacement of drive)
>             4          12           48            12         0.0009%
>             6           8           48            24         0.003%
>             8           6           48            30         0.006%
>            12           4           48            36         0.02%
>            16           3           48            39         0.03%
>            24           2           48            42         0.06%
>            48           1           48            45         0.25%

> RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
> disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
>             2          24           48            n/a        (~0.1% with manual replacement of drive)
>             3          16           48            16         0.2%
>             4          12           48            24         0.3%
>             6           8           48            32         0.5%
>             8           6           48            36         0.8%
>            12           4           48            40         1.3%
>            16           3           48            42         1.7%
>            24           2           48            44         2.5%
>            48           1           48            46         5%

so if I've done the math correctly the odds of losing data with the 
worst-case double-parity (one large array including hotspare) are about 
the same as the best case single parity (mirror+ hotspare), but with 
almost triple the capacity.

In response to


pgsql-performance by date

Next:From: RonDate: 2007-04-08 00:46:33
Subject: Re: SCSI vs SATA
Previous:From: Arjen van der MeijdenDate: 2007-04-07 17:28:52
Subject: Re: fast DISTINCT or EXIST

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group