Skip site navigation (1) Skip section navigation (2)


From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: david(at)lang(dot)hm
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: SSD + RAID
Date: 2010-02-23 21:32:13
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On 02/23/2010 04:22 PM, david(at)lang(dot)hm wrote:
> On Tue, 23 Feb 2010, Aidan Van Dyk wrote:
>> * david(at)lang(dot)hm <david(at)lang(dot)hm> [100223 15:05]:
>>> However, one thing that you do not get protection against with software
>>> raid is the potential for the writes to hit some drives but not others.
>>> If this happens the software raid cannot know what the correct contents
>>> of the raid stripe are, and so you could loose everything in that 
>>> stripe
>>> (including contents of other files that are not being modified that
>>> happened to be in the wrong place on the array)
>> That's for stripe-based raid.  Mirror sets like raid-1 should give you
>> either the old data, or the new data, both acceptable responses since
>> the fsync/barreir hasn't "completed".
>> Or have I missed another subtle interaction?
> one problem is that when the system comes back up and attempts to 
> check the raid array, it is not going to know which drive has valid 
> data. I don't know exactly what it does in that situation, but this 
> type of error in other conditions causes the system to take the array 
> offline.

I think the real concern here is that depending on how the data is read 
later - and depending on which disks it reads from - it could read 
*either* old or new, at any time in the future. I.e. it reads "new" from 
disk 1 the first time, and then an hour later it reads "old" from disk 2.

I think this concern might be invalid for a properly running system, 
though. When a RAID array is not cleanly shut down, the RAID array 
should run in "degraded" mode until it can be sure that the data is 
consistent. In this case, it should pick one drive, and call it the 
"live" one, and then rebuild the other from the "live" one. Until it is 
re-built, it should only satisfy reads from the "live" one, or parts of 
the "rebuilding" one that are known to be clean.

I use mdadm software RAID, and all of me reading (including some of its 
source code) and experience (shutting down the box uncleanly) tells me, 
it is working properly. In fact, the "rebuild" process can get quite 
ANNOYING as the whole system becomes much slower during rebuild, and 
rebuild of large partitions can take hours to complete.

For mdadm, there is a not-so-well-known "write-intent bitmap" 
capability. Once enabled, mdadm will embed a small bitmap (128 bits?) 
into the partition, and each bit will indicate a section of the 
partition. Before writing to a section, it will mark that section as 
dirty using this bitmap. It will leave this bit set for some time after 
the partition is "clean" (lazy clear). The effect of this, is that at 
any point in time, only certain sections of the drive are dirty, and on 
recovery, it is a lot cheaper to only rebuild the dirty sections. It 
works really well.

So, I don't think this has to be a problem. There are solutions, and any 
solution that claims to be complete should offer these sorts of 


In response to

pgsql-performance by date

Next:From: negoraDate: 2010-02-23 21:33:24
Subject: Re: Internal operations when the planner makes a hash join.
Previous:From: Kevin GrittnerDate: 2010-02-23 21:23:54
Subject: Re: moving pg_xlog -- yeah, it's worth it!

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group