Re: Question: BlockSize > 8192 with FusionIO

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, "Strange, John W" <john(dot)w(dot)strange(at)jpmchase(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Question: BlockSize > 8192 with FusionIO
Date: 2011-01-05 06:41:20
Message-ID: 42AF139A-0385-4226-B81C-9569FB64873E@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote:

> On Mon, Jan 3, 2011 at 9:13 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> Strange, John W wrote:
>>>
>>> Has anyone had a chance to recompile and try larger a larger blocksize
>>> than 8192 with pSQL 8.4.x?
>>
>> While I haven't done the actual experiment you're asking about, the problem
>> working against you here is how WAL data is used to protect against partial
>> database writes. See the documentation for full_page_writes at
>> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>> Because full size copies of the blocks have to get written there, attempts
>> to chunk writes into larger pieces end up requiring a correspondingly larger
>> volume of writes to protect against partial writes to those pages. You
>> might get a nice efficiency gain on the read side, but the situation when
>> under a heavy write load (the main thing you have to be careful about with
>> these SSDs) is much less clear.
>
> most flash drives, especially mlc flash, use huge blocks anyways on
> physical level. the numbers claimed here
> (http://www.fusionio.com/products/iodrive/) (141k write iops) are
> simply not believable without write buffering. i didn't see any note
> of how fault tolerance is maintained through the buffer (anyone
> know?).

Flash may have very large erase blocks -- 4k to 16M, but you can write to it at much smaller block sizes sequentially.

It has to delete a block in bulk, but it can write to an erased block bit by bit, sequentially (512 or 4096 bytes typically, but some is 8k and 16k).

Older MLC NAND flash could be written to at a couple bytes at a time -- but drives today incorporate too much EEC and use larger chunks to do that. The minimum write size now is caused by the EEC requirements and not the physical NAND flash requirements.

So, buffering isn't that big of a requirement with the current LBA > Physical translations which change all writes -- random or not -- to sequential writes in one erase block.
But performance if waiting for the write to complete will not be all that good, especially with MLC. Turn off the buffer on an Intel SLC drive for example, and write IOPS is cut by 1/3 or more -- to 'only' 1000 or so iops.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message kakarukeys 2011-01-05 07:09:58 Re: adding foreign key constraint locks up table
Previous Message Greg Smith 2011-01-05 01:51:49 Re: Same stament sometime fast, something slow