Re: SSD + RAID

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Brad Nicholson <bnichols(at)ca(dot)afilias(dot)info>, Karl Denninger <karl(at)denninger(dot)net>, Laszlo Nagy <gandalf(at)shopzeus(dot)com>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: SSD + RAID
Date: 2009-11-19 04:35:02
Message-ID: C72A0AF6.175EA%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 11/17/09 10:51 AM, "Greg Smith" <greg(at)2ndquadrant(dot)com> wrote:

> Merlin Moncure wrote:
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.
> The funny thing about Murphy is that he doesn't visit when things are
> quiet. It's quite possible the window for data loss on the drive is
> very small. Maybe you only see it one out of 10 pulls with a very
> aggressive database-oriented write test. Whatever the odd conditions
> are, you can be sure you'll see them when there's a bad outage in actual
> production though.

Yes, but there is nothing fool proof. Murphy visited me recently, and the
RAID card with BBU cache that the WAL logs were on crapped out. Data was
fine.

Had to fix up the system without any WAL logs. Luckily, out of 10TB, only
200GB or so of it could have been in the process of writing (yay!
partitioning by date!) to and we could restore just that part rather than
initiating a full restore.
Then there was fun times in single user mode to fix corrupted system tables
(about half the system indexes were dead, and the statistics table was
corrupt, but that could be truncated safely).

Its all fine now with all data validated.

Moral of the story: Nothing is 100% safe, so sometimes a small bit of KNOWN
risk is perfectly fine. There is always UNKNOWN risk. If one risks losing
256K of cached data on an SSD if you're really unlucky with timing, how
dangerous is that versus the chance that the raid card or other hardware
barfs and takes out your whole WAL?

Nothing is safe enough to avoid a full DR plan of action. The individual
tradeoffs are very application and data dependent.

>
> A good test program that is a bit better at introducing and detecting
> the write cache issue is described at
> http://brad.livejournal.com/2116715.html
>
> --
> Greg Smith 2ndQuadrant Baltimore, MD
> PostgreSQL Training, Services and Support
> greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2009-11-19 04:39:02 Re: SSD + RAID
Previous Message Tom Lane 2009-11-19 04:24:00 Re: SSD + RAID