Skip site navigation (1) Skip section navigation (2)

Re: ext4 finally doing the right thing

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 05:58:13
Message-ID: 4B57ECF5.7050502@2ndquadrant.com (view raw or flat)
Thread:
Lists: pgsql-performance
Greg Stark wrote:
>
> That doesn't sound right. The kernel having 10% of memory dirty 
> doesn't mean there's a queue you have to jump at all. You don't get 
> into any queue until the kernel initiates write-out which will be 
> based on the usage counters -- basically a lru. fsync and cousins like 
> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out 
> right away.
>

Most safe ways ext3 knows how to initiate a write-out on something that 
must go (because it's gotten an fsync on data there) requires flushing 
every outstanding write to that filesystem along with it.  So as soon as 
a single WAL write shows up, bam!  The whole cache is emptied (or at 
least everything associated with that filesystem), and the caller who 
asked for that little write is stuck waiting for everything to clear 
before their fsync returns success.

This particular issue absolutely killed Firefox when they switched to 
using SQLite not too long ago; high-level discussion at 
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and 
confirmation/discussion of the issue on lkml at 
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 . 

Note the comment from the first article saying "those delays can be 30 
seconds or more".  On multiple occasions, I've measured systems with 
dozens of disks in a high-performance RAID1+0 with battery-backed 
controller that could grind to a halt for 10, 20, or more seconds in 
this situation, when running pgbench on a big database.  As was the case 
on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of 
random I/O from background writer/checkpoint writes back up because 
Linux has been lazy about getting to them, that takes a while to clear 
no matter how good the underlying hardware.

Write barriers were supposed to improve all this when added to ext3, but 
they just never seemed to work right for many people.  After reading 
that lkml thread, among others, I know I was left not trusting anything 
beyond the simplest path through this area of the filesystem.  Slow is 
better than corrupted.

So the good news I was relaying is that it looks like this finally work 
on ext4, giving it the behavior you described and expected, but that's 
not actually been there until now.  I was hoping someone with more free 
time than me might be interested to go investigate further if I pointed 
the advance out.  I'm stuck with too many production systems to play 
with new kernels at the moment, but am quite curious.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com  www.2ndQuadrant.com


In response to

Responses

pgsql-performance by date

Next:From: Scott CareyDate: 2010-01-21 08:25:41
Subject: Re: Inserting 8MB bytea: just 25% of disk perf used?
Previous:From: Greg StarkDate: 2010-01-21 05:15:40
Subject: Re: ext4 finally doing the right thing

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group