Re: PostgreSQL block size for SSD RAID setup?

From: Scott Carey <scott(at)richrelevance(dot)com>
To: PFC <lists(at)peufeu(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: PostgreSQL block size for SSD RAID setup?
Date: 2009-02-25 19:23:07
Message-ID: C5CADA9B.2BBD%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Most benchmarks and reviews out there are very ignorant on SSD design. I suggest you start by reading some white papers and presentations on the research side that are public:
(pdf) http://research.microsoft.com/pubs/63596/USENIX-08-SSD.pdf
(html) http://www.usenix.org/events/usenix08/tech/full_papers/agrawal/agrawal_html/index.html
Pdf presentation power point style: http://institute.lanl.gov/hec-fsio/workshops/2008/presentations/day3/Prabhakaran-Panel-SSD.pdf

Benchmarks by EasyCo (software layer that does what the hardware should if your ssd's controller stinks):
http://www.storagesearch.com/easyco-flashperformance-art.pdf

On 2/25/09 10:28 AM, "PFC" <lists(at)peufeu(dot)com> wrote:
> Hi,
> I was reading a benchmark that sets out block sizes against raw IO
> performance for a number of different RAID configurations involving high
> end SSDs (the Mtron 7535) on a powerful RAID controller (the Areca
> 1680IX with 4GB RAM). See
> http://jdevelopment.nl/hardware/one-dvd-per-second/

Lucky guys ;)

Something that bothers me about SSDs is the interface... The latest flash
chips from Micron (32Gb = 4GB per chip) have something like 25 us "access
time" (lol) and push data at 166 MB/s (yes megabytes per second) per chip.
So two of these chips are enough to bottleneck a SATA 3Gbps link... there
would be 8 of those chips in a 32GB SSD. Parallelizing would depend on the
block size : putting all chips in parallel would increase the block size,
so in practice I don't know how it's implemented, probably depends on the
make and model of SSD.

No, you would need at least 10 to 12 of those chips for such a SSD (that does good wear leveling), since overprovisioning is required for wear leveling and write amplification factor.

And then RAIDing those (to get back the lost throughput from using SATA)
will again increase the block size which is bad for random writes. So it's
a bit of a chicken and egg problem.

With cheap low end SSD's that don't deal with random writes properly, and can't remap LBA 's to physical blocks in small chunks, and raid stripes smaller than erase blocks, yes. But for SSD's you want large RAID block sizes, no raid 5, without pre-loading the whole block on a small read. This is since random access inside one block is fast, unlike hard drives.

Also since harddisks have high
throughput but slow seeks, all the OS'es and RAID cards, drivers, etc are
probably optimized for throughput, not IOPS. You need a very different
strategy for 100K/s 8kbyte IOs versus 1K/s 1MByte IOs. Like huge queues,
smarter hardware, etc.

Yes. I get better performance with software raid 10, multiple plain SAS adapters, and SSD's than any raid card I've tried because the raid card can't keep up with the i/o's and tries to do a lot of scheduling work. Furthermore, a Battery backed memory caching card is forced to prioritize writes at the expense of reads, which causes problems when you want to keep read latency low during a large batch write. Throw the same requests at a good SSD, and it works (90% of them are bad schedulers and with concurrent read/write at the moment though).

FusionIO got an interesting product by using the PCI-e interface which
brings lots of benefits like much higher throughput and the possibility of
using custom drivers optimized for handling much more IO requests per
second than what the OS and RAID cards, and even SATA protocol, were
designed for.

Intrigued by this I looked at the FusionIO benchmarks : more than 100.000
IOPS, really mindboggling, but in random access over a 10MB file. A little
bit of google image search reveals the board contains a lot of Flash chips
(expected) and a fat FPGA (expected) probably a high-end chip from X or A,
and two DDR RAM chips from Samsung, probably acting as cache. So I wonder
if the 10 MB file used as benchmark to reach those humongous IOPS was
actually in the Flash ?... or did they actually benchmark the device's
onboard cache ?...

Intel's SSD, and AFAIK FusionIO's device, do not cache writes in RAM (a tiny bit is buffered in SRAM, 256K=erase block size, on the intel controller; unknown in FusionIO's FPGA).
That ram is the working space cache for the LBA -> physical block remapping. When a request comes in for a read, looking up what physical block contains the LBA would take a long time if it was going through the flash (its the block that claims to be mapped that way, with the highest transaction number - or some other similar algorithm). The lookup table is cached in RAM. The wear leveling and other tasks need working set memory to operate as well.

It probably has writeback cache so on a random writes benchmark this is
an interesting question. A good RAID card with BBU cache would have the
same benchmarking gotcha (ie if you go crazy on random writes on a 10 MB
file which is very small, and the device is smart, possibly at the end of
the benchmark nothing at all was written to the disks !)

The numbers are slower, but not as dramatic as you would expect, for a 10GB file. Its clearly not a writeback cache.

Anyway in a database use case if random writes are going to be a pain
they are probably not going to be distributed in a tiny 10MB zone which
the controller cache would handle...

(just rambling XDD)

Certain write load mixes can fragment the LBA > physical block map and make wear leveling and write amplification reduction expensive and slow things down. This effect is usually temporary and highly workload dependant.
The solution (see white papers above) is more over-provisioning of flash. This can be achieved manually by making sure that more of the LBA's are NEVER ever written to - partition just 75% of the drive and leave the last 25% untouched, then there will be that much more extra to work with which makes even insanely crazy continuous random writes over the whole space perform at very high iops with low latency. This is only necessary for particular loads, and all flash devices over-provision to some extent. I'm pretty sure that the Intel X25-M, which provides 80GB to the user, has at least 100GB of actual flash in there - perhaps 120GB. That overprovision may be internal to the actual flash chip, since Intel makes both the chip and controller. There is absolutely extra ECC and block metadata in there (this is not new, again, see the whitepaper).
The X25-E certainly is over-provisioned.

In the future, there are two things that will help flash a lot:
*File systems that avoid writing to a region as long as possible, preferring to write to areas previously freed at some point.
*New OS block device semantics. Currently its 'read' and 'write'. The device, once all LBA's have had a write to them once, is always "100%" full. A 'deallocate' command would help SSD random writes, wear leveling, and write amplification algorithms significantly.

--
Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2009-02-25 19:52:50 Re: Abnormal performance difference between Postgres and MySQL
Previous Message Farhan Husain 2009-02-25 19:05:20 Re: Abnormal performance difference between Postgres and MySQL