Skip site navigation (1) Skip section navigation (2)

Re: SSD + RAID

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Scott Carey <scott(at)richrelevance(dot)com>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Laszlo Nagy <gandalf(at)shopzeus(dot)com>, Ivan Voras <ivoras(at)freebsd(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: SSD + RAID
Date: 2009-11-19 21:04:29
Message-ID: 4B05B2DD.2050201@2ndquadrant.com (view raw or flat)
Thread:
Lists: pgsql-performance
Scott Carey wrote:
> Have PG wait a half second (configurable) after the checkpoint fsync()
> completes before deleting/ overwriting any WAL segments.  This would be a
> trivial "feature" to add to a postgres release, I think.  Actually, it
> already exists!  Turn on log archiving, and have the script that it runs after a checkpoint sleep().
>   
That won't help.  Once the checkpoint is done, the problem isn't just 
that the WAL segments are recycled.  The server isn't going to use them 
even if they were there.  The reason why you can erase/recycle them is 
that you're doing so *after* writing out a checkpoint record that says 
you don't have to ever look at them again.  What you'd actually have to 
do is hack the server code to insert that delay after every fsync--there 
are none that you can cheat on and not introduce a corruption 
possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't 
make a lot of assumptions about what the underlying disk has to actually 
do beyond the fsync requirement; the flip side to that robustness is 
that it's the one you can't ever violate safely.
> BTW, the information I have seen indicates that the write cache is 256K on
> the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
> block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
> space).
>   
Right.  It's not used like the write-cache on a regular hard drive, 
where they're buffering 8MB-32MB worth of writes just to keep seek 
overhead down.  It's there primarily to allow combining writes into 
large chunks, to better match the block size of the underlying SSD flash 
cells (128K).  Having enough space for two full cells allows spooling 
out the flash write to a whole block while continuing to buffer the next 
one.

This is why turning the cache off can tank performance so badly--you're 
going to be writing a whole 128K block no matter what if it's force to 
disk without caching, even if it's just to write a 8K page to it.  
That's only going to reach 1/16 of the usual write speed on single page 
writes.  And that's why you should also be concerned at whether 
disabling the write cache impacts the drive longevity, lots of small 
writes going out in small chunks is going to wear flash out much faster 
than if the drive is allowed to wait until it's got a full sized block 
to write every time.

The fact that the cache is so small is also why it's harder to catch the 
drive doing the wrong thing here.  The plug test is pretty sensitive to 
a problem when you've got megabytes worth of cached writes that are 
spooling to disk at spinning hard drive speeds.  The window for loss on 
a SSD with no seek overhead and only a moderate number of KB worth of 
cached data is much, much smaller.  Doesn't mean it's gone though.  It's 
a shame that the design wasn't improved just a little bit; a cheap 
capacitor and blocking new writes once the incoming power dropped is all 
it would take to make these much more reliable for database use.  But 
that would raise the price, and not really help anybody but the small 
subset of the market that cares about durable writes.
> 4: Yet another solution:  The drives DO adhere to write barriers properly.
> A filesystem that used these in the process of fsync() would be fine too.
> So XFS without LVM or MD (or the newer versions of those that don't ignore
> barriers) would work too.
>   
If I really trusted anything beyond the very basics of the filesystem to 
really work well on Linux, this whole issue would be moot for most of 
the production deployments I do.  Ideally, fsync would just push out the 
minimum of what's needed, it would call the appropriate write cache 
flush mechanism the way the barrier implementation does when that all 
works, life would be good.  Alternately, you might even switch to using 
O_SYNC writes instead, which on a good filesystem implementation are 
both accelerated and safe compared to write/fsync (I've seen that work 
as expected on Vertias VxFS for example). 

Meanwhile, in the actual world we live, patches that make writes more 
durable by default are dropped by the Linux community because they tank 
performance for too many types of loads, I'm frightened to turn on 
O_SYNC at all on ext3 because of reports of corruption on the lists 
here, fsync does way more work than it needs to, and the way the 
filesystem and block drivers have been separated makes it difficult to 
do any sort of device write cache control from userland.  This is why I 
try to use the simplest, best tested approach out there whenever possible.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com  www.2ndQuadrant.com


In response to

Responses

pgsql-performance by date

Next:From: Greg SmithDate: 2009-11-19 21:10:47
Subject: Re: SSD + RAID
Previous:From: Brad NicholsonDate: 2009-11-19 18:57:51
Subject: Re: SSD + RAID

Privacy Policy | About PostgreSQL
Copyright © 1996-2013 The PostgreSQL Global Development Group