Re: SSD + RAID

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Pierre C <lists(at)peufeu(dot)com>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Mark Mielke <mark(at)mark(dot)mielke(dot)cc>, Arjen van der Meijden <acmmailing(at)tweakers(dot)net>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: SSD + RAID
Date: 2010-02-23 18:36:56
Message-ID: B9BC5B98-5128-49F8-9CB9-11DD5AE983DD@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Feb 23, 2010, at 3:49 AM, Pierre C wrote:
> Now I wonder about something. SSDs use wear-leveling which means the
> information about which block was written where must be kept somewhere.
> Which means this information must be updated. I wonder how crash-safe and
> how atomic these updates are, in the face of a power loss. This is just
> like a filesystem. You've been talking only about data, but the block
> layout information (metadata) is subject to the same concerns. If the
> drive says it's written, not only the data must have been written, but
> also the information needed to locate that data...
>
> Therefore I think the yank-the-power-cord test should be done with random
> writes happening on an aged and mostly-full SSD... and afterwards, I'd be
> interested to know if not only the last txn really committed, but if some
> random parts of other stuff weren't "wear-leveled" into oblivion at the
> power loss...
>

A couple years ago I postulated that SSD's could do random writes fast if they remapped blocks. Microsoft's SSD whitepaper at the time hinted at this too.
Persisting the remap data is not hard. It goes in the same location as the data, or a separate area that can be written to linearly.

Each block may contain its LBA and a transaction ID or other atomic count. Or another block can have that info. When the SSD
powers up, it can build its table of LBA > block by looking at that data and inverting it and keeping the highest transaction ID for duplicate LBA claims.

Although SSD's have to ERASE data in a large block at a time (256K to 2M typically), they can write linearly to an erased block in much smaller chunks.
Thus, to commit a write, either:
Data, LBA tag, and txID in same block (may require oddly sized blocks).
or
Data written to one block (not committed yet), then LBA tag and txID written elsewhere (which commits the write). Since its all copy on write, partial writes can't happen.
If a block is being moved or compressed when power fails data should never be lost since the old data still exists, the new version just didn't commit. But new data that is being written may not be committed yet in the case of a power failure unless other measures are taken.

>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2010-02-23 18:49:36 Re: Internal operations when the planner makes a hash join.
Previous Message Alvaro Herrera 2010-02-23 16:53:39 Re: Internal operations when the planner makes a hash join.