Quick Links

Re: ext4 finally doing the right thing

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: ext4 finally doing the right thing
Date:	2010-01-21 05:58:13
Message-ID:	4B57ECF5.7050502@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Greg Stark wrote:
>
> That doesn't sound right. The kernel having 10% of memory dirty
> doesn't mean there's a queue you have to jump at all. You don't get
> into any queue until the kernel initiates write-out which will be
> based on the usage counters -- basically a lru. fsync and cousins like
> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out
> right away.
>

Most safe ways ext3 knows how to initiate a write-out on something that
must go (because it's gotten an fsync on data there) requires flushing
every outstanding write to that filesystem along with it. So as soon as
a single WAL write shows up, bam! The whole cache is emptied (or at
least everything associated with that filesystem), and the caller who
asked for that little write is stuck waiting for everything to clear
before their fsync returns success.

This particular issue absolutely killed Firefox when they switched to
using SQLite not too long ago; high-level discussion at
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and
confirmation/discussion of the issue on lkml at
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .

Note the comment from the first article saying "those delays can be 30
seconds or more". On multiple occasions, I've measured systems with
dozens of disks in a high-performance RAID1+0 with battery-backed
controller that could grind to a halt for 10, 20, or more seconds in
this situation, when running pgbench on a big database. As was the case
on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of
random I/O from background writer/checkpoint writes back up because
Linux has been lazy about getting to them, that takes a while to clear
no matter how good the underlying hardware.

Write barriers were supposed to improve all this when added to ext3, but
they just never seemed to work right for many people. After reading
that lkml thread, among others, I know I was left not trusting anything
beyond the simplest path through this area of the filesystem. Slow is
better than corrupted.

So the good news I was relaying is that it looks like this finally work
on ext4, giving it the behavior you described and expected, but that's
not actually been there until now. I was hoping someone with more free
time than me might be interested to go investigate further if I pointed
the advance out. I'm stuck with too many production systems to play
with new kernels at the moment, but am quite curious.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com

In response to

Re: ext4 finally doing the right thing at 2010-01-21 05:15:40 from Greg Stark

Responses

Re: ext4 finally doing the right thing at 2010-01-21 11:13:42 from Greg Stark
Re: ext4 finally doing the right thing at 2010-01-21 13:51:29 from Aidan Van Dyk
Re: ext4 finally doing the right thing at 2010-01-21 14:04:25 from Florian Weimer

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Scott Carey	2010-01-21 08:25:41	Re: Inserting 8MB bytea: just 25% of disk perf used?
Previous Message	Greg Stark	2010-01-21 05:15:40	Re: ext4 finally doing the right thing