Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>, ow(dot)mun(dot)heng(at)wdc(dot)com
Subject: Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?
Date: 2009-10-20 11:02:51
Message-ID: 4ADD98DB.702@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 20/10/2009 4:41 PM, Scott Marlowe wrote:

>> I have a 4 disk Raid10 array running on linux MD raid.
>> Sda / sdb / sdc / sdd
>>
>> One fine day, 2 of the drives just suddenly decide to die on me. (sda and
>> sdd)
>>
>> I've tried multiple methods to try to determine if I can get them back
>> online

You made an exact image of each drive onto new, spare drives with `dd'
or a similar disk imaging tool before trying ANYTHING, right?

Otherwise, you may well have made things worse, particularly since
you've tried to resync the array. Even if the data was recoverable
before, it might not be now.

How, exactly, have the drives failed? Are they totally dead, so that the
BIOS / disk controller don't even see them? Can the partition tables be
read? Does 'file -s /dev/sda' report any output? What's the output of:

smartctl -d ata -a /dev/sda

(repeat for sdd)

?

If the problem is just a few bad sectors, you can usually just
force-re-add the drives into the array and then copy the array contents
to another drive either at a low level (with dd_rescue) or at a file
system level.

If the problem is one or more totally fried drives, where the drive is
totally inaccessible or most of the data is hopelessly corrupt /
unreadable, then you're in a lot more trouble. RAID 10 effectively
stripes the data across the mirrored pairs, so if you lose a whole
mirrored pair you've lost half the stripes. It's not that different from
running paper through a shredder, discarding half the shreds, and lining
it all back up.

On a side note: I'm personally increasingly annoyed with the tendency of
RAID controllers (and s/w raid implementations) to treat disks with
unrepairable bad sectors as dead and fail them out of the array. That's
OK if you have a hot spare and no other drive fails during rebuild, but
it's just not good enough if failing that drive would result in the
array going into failed state. Rather than failing a drive and as a
result rendering the whole array unreadable in such situations, it
should mark the drive defective, set the array to read-only, and start
screaming for help. Way too much data gets murdered by RAID
implementations removing mildly faulty drives from already-degraded
arrays instead of just going read-only.

--
Craig Ringer

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2009-10-20 11:09:42 Re: different execution times of the same query
Previous Message Craig Ringer 2009-10-20 10:52:37 Re: 答复: [GENERAL] About could not connect to server: Connection timed out