Re: Fwd: Re: SSDD reliability

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Fwd: Re: SSDD reliability
Date: 2011-05-05 21:31:38
Message-ID: 4DC3173A.4080600@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 05/04/2011 08:31 PM, David Boreham wrote:
> Here's my best theory at present : the failures ARE caused by cell
> wear-out, but the SSD firmware is buggy in so far as it fails to boot
> up and respond to host commands due to the wear-out state. So rather
> than the expected outcome (SSD responds but has read-only behavior),
> it appears to be (and is) dead. At least to my mind, this is a more
> plausible explanation for the reported failures vs. the alternative
> (SSD vendors are uniquely clueless at making basic electronics
> subassemblies), especially considering the difficulty in testing the
> firmware under all possible wear-out conditions.
>
> One question worth asking is : in the cases you were involved in, was
> manufacturer failure analysis performed (and if so what was the
> failure cause reported?).

Unfortunately not. Many of the people I deal with, particularly the
ones with budgets to be early SSD adopters, are not the sort to return
things that have failed to the vendor. In some of these shops, if the
data can't be securely erased first, it doesn't leave the place. The
idea that some trivial fix at the hardware level might bring the drive
back to life, data intact, is terrifying to many businesses when drives
fail hard.

Your bigger point, that this could just easily be software failures due
to unexpected corner cases rather than hardware issues, is both a fair
one to raise and even more scary.

>> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
>> deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
>> mechanical drives is around 2% during their first year, spiking to 5%
>> afterwards. I suspect that Intel's numbers are actually much better
>> than the other manufacturers here, so a SSD from anyone else can
>> easily be less reliable than a regular hard drive still.
>>
> Hmm, this is speculation I don't support (non-intel vendors have a 10x
> worse early failure rate). The entire industry uses very similar
> processes (often the same factories). One rogue vendor with a bad
> process...sure, but all of them ??
>

I was postulating that you only have to be 4X as bad as Intel to reach
2.4%, and then be worse than a mechanical drive for early failures. If
you look at http://labs.google.com/papers/disk_failures.pdf you can see
there's a 5:1 ratio in first-year AFR just between light and heavy usage
on the drive. So a 4:1 ratio between best and worst manufacturer for
SSD seemed possible. Plenty of us have seen particular drive models
that were much more than 4X as bad as average ones among regular hard
drives.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Browse pgsql-general by date

  From Date Subject
Next Message David Johnston 2011-05-05 21:47:22 Re: multiple sequence number for one column
Previous Message Merlin Moncure 2011-05-05 21:19:01 Re: multiple sequence number for one column