Skip site navigation (1) Skip section navigation (2)

Re: Sunfire X4500 recommendations

From: david(at)lang(dot)hm
To: Matt Smiley <mss(at)rentrak(dot)com>
Cc: dimitrik(dot)fr(at)gmail(dot)com, pgsql-performance(at)postgresql(dot)org
Subject: Re: Sunfire X4500 recommendations
Date: 2007-03-28 05:34:38
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Tue, 27 Mar 2007, Matt Smiley wrote:

> --------
> The goal is to calculate the probability of data loss when we loose a 
> certain number of disks within a short timespan (e.g. loosing a 2nd disk 
> before replacing+rebuilding the 1st one).  For RAID 10, 50, and Z, we 
> will loose data if any disk group (i.e. mirror or parity-group) looses 2 
> disks.  For RAID 60 and Z2, we will loose data if 3 disks die in the 
> same parity group.  The parity groups can include arbitrarily many 
> disks.  Having larger groups gives us more usable diskspace but less 
> protection.  (Naturally we're more likely to loose 2 disks in a group of 
> 50 than in a group of 5.)
>    g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
>    n = total number of disks
>    risk of loosing any 1 disk = 1/n

please explain why you are saying that the risk of loosing any 1 disk is 
1/n. shouldn't it be probability of failure * n instead?

>    risk of loosing 1 disk from a particular group = g/n
>    risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1)
>    risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2)

following this logic the risk of loosing all 48 disks in a single group of 
48 would be 100%

also what you are looking for is the probability of the second (and third) 
disks failing in time X (where X is the time nessasary to notice the 
failure, get a replacement, and rebuild the disk)

the killer is the time needed to rebuild the disk, with multi-TB arrays 
is't sometimes faster to re-initialize the array and reload from backup 
then it is to do a live rebuild (the servers had a raid failure 
recently and HPA mentioned that it took a week to rebuild the array, but 
it would have only taken a couple days to do a restore from backup)

add to this the fact that disk failures do not appear to be truely 
independant from each other statisticly (see the recent studies released 
by google and cmu), and I wouldn't bother with single-parity for a 
multi-TB array. If the data is easy to recreate (including from backup) or 
short lived (say a database of log data that cycles every month or so) I 
would just do RAID-0 and plan on loosing the data on drive failure (this 
assumes that you can afford the loss of service when this happens). if the 
data is more important then I'd do dual-parity or more, along with a hot 
spare so that the rebuild can start as soon as the first failure is 
noticed by the system to give myself a fighting chance to save things.

> In terms of performance, I think RAID 10 should always be best for write 
> speed.  (Since it doesn't calculate parity, writing a new block doesn't 
> require reading the rest of the RAID stripe just to recalculate the 
> parity bits.)  I think it's also normally just as fast for reading, 
> since the controller can load-balance the pending read requests to both 
> sides of each mirror.

this depends on your write pattern. if you are doing sequential writes 
(say writing a log archive) then RAID 5 can be faster then RAID 10. since 
there is no data there to begin with the system doesn't have to read 
anything to calculate the parity, and with the data spread across more 
spindles you have a higher potential throughput.

if your write pattern is is more random, and especially if you are 
overwriting existing data then the reads needed to calculate the parity 
will slow you down.

as for read speed, it all depends on your access pattern and stripe size. 
if you are reading data that spans disks (larger then your stripe size) 
you end up with a single read tieing up multiple spindles. with Raid 1 
(and varients) you can read from either disk of the set if you need 
different data within the same stripe that's on different disk tracks (if 
it's on the same track you'll get it just as fast reading from a single 
drive, or so close to it that it doesn't matter). beyond that the question 
is how many spindles can you keep busy reading (as opposed to seeking to 
new data or sitting idle becouse you don't need their data)

the worst case for reading is to be jumping through your data in strides 
of stripe*# disks available (accounting for RAID type) as all your reads 
will end up hitting the same disk.

David Lang

In response to

pgsql-performance by date

Next:From: Daniel Cristian CruzDate: 2007-03-28 12:59:18
Subject: Improving performance on system catalog
Previous:From: Matt SmileyDate: 2007-03-28 04:44:42
Subject: Re: Sunfire X4500 recommendations

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group