On Wed, 26 Dec 2007, Mark Mielke wrote:
> david(at)lang(dot)hm wrote:
>>> Thanks for the explanation David. It's good to know not only what but also
>>> why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
>>> read: the one with the data and the parity disk?
>> no, becouse the parity is of the sort (A+B+C+P) mod X = 0
>> so if X=10 (which means in practice that only the last decimal digit of
>> anything matters, very convienient for examples)
>> A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
>> if you read B and get 3 and P and get 4 you don't know if this is right or
>> not unless you also read A and C (at which point you would get
> I don't think this is correct. RAID 5 is parity which is XOR. The property of
> XOR is such that it doesn't matter what the other drives are. You can write
> any block given either: 1) The block you are overwriting and the parity, or
> 2) all other blocks except for the block we are writing and the parity. Now,
> it might be possible that option 2) is taken more than option 1) for some
> complicated reasons, but it is NOT to check consistency. The array is assumed
> consistent until proven otherwise.
I was being sloppy in explaining the reason, you are correct that for
writes you don't need to read all the data, you just need the current
parity block, the old data you are going to replace, and the new data to
be able to calculate the new parity block (and note that even with my
checksum example this would be the case).
however I was addressing the point that for reads you can't do any
checking until you have read in all the blocks.
if you never check the consistency, how will it ever be proven otherwise.
>> in theory a system could get the same performance with a large sequential
>> read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same
>> number of data disks, ignoring the parity disks) becouse the OS could read
>> the entire stripe in at once, do the calculation once, and use all the data
>> (or when writing, don't write anything until you are ready to write the
>> entire stripe, calculate the parity and write everything once).
> For the same number of drives, this cannot be possible. With 10 disks, on
> raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum
> performance is only 9/10 of the 10/10 performance possible with RAID 0.
I was saying that a 10 drive raid0 could be the same performance as a 10+1
drive raid 5 or a 10+2 drive raid 6 array.
this is why I said 'same number of data disks, ignoring the parity disks'.
in practice you would probably not do quite this good anyway (you have the
parity calculation to make and the extra drive or two's worth of data
passing over your busses), but it could be a lot closer then any
implementation currently is.
>> Unfortunantly in practice filesystems don't support this, they don't do
>> enough readahead to want to keep the entire stripe (so after they read it
>> all in they throw some of it away), they (mostly) don't know where a stripe
>> starts (and so intermingle different types of data on one stripe and spread
>> data across multiple stripes unessasarily), and they tend to do writes in
>> small, scattered chunks (rather then flushing an entire stripes worth of
>> data at once)
> In my experience, this theoretical maximum is not attainable without
> significant write cache, and an intelligent controller, neither of which
> Linux software RAID seems to have by default. My situation was a bit worse in
> that I used applications that fsync() or journalled metadata that is ordered,
> which forces the Linux software RAID to flush far more than it should - but
> the same system works very well with RAID 1+0.
my statements above apply to any type of raid implementation, hardware or
the thing that saves the hardware implementation is that the data is
written to a battery-backed cache and the controller lies to the system,
telling it that the write is complete, and then it does the write later.
on a journaling filesystem you could get very similar results if you put
the journal on a solid-state drive.
but for your application, the fact that you are doing lots of fsyncs is
what's killing you, becouse the fsync forces a lot of data to be written
out, swamping the caches involved, and requiring that you wait for seeks.
nothing other then a battery backed disk cache of some sort (either on the
controller or a solid-state drive on a journaled filesystem would work)
In response to
pgsql-performance by date
|Next:||From: Greg Smith||Date: 2007-12-26 23:35:33|
|Subject: Re: More shared buffers causes lower performances|
|Previous:||From: david||Date: 2007-12-26 23:05:50|
|Subject: Re: With 4 disks should I go for RAID 5 or RAID 10|