Re: ext4 finally doing the right thing

From: Greg Stark <stark(at)mit(dot)edu>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 11:13:42
Message-ID: 407d949e1001210313w1668d7e2jaee3b4d7984a059@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Both of those refer to the *drive* cache.

greg

On 21 Jan 2010 05:58, "Greg Smith" <greg(at)2ndquadrant(dot)com> wrote:

Greg Stark wrote: > > > That doesn't sound right. The kernel having 10% of
memory dirty doesn't mean...
Most safe ways ext3 knows how to initiate a write-out on something that must
go (because it's gotten an fsync on data there) requires flushing every
outstanding write to that filesystem along with it. So as soon as a single
WAL write shows up, bam! The whole cache is emptied (or at least everything
associated with that filesystem), and the caller who asked for that little
write is stuck waiting for everything to clear before their fsync returns
success.

This particular issue absolutely killed Firefox when they switched to using
SQLite not too long ago; high-level discussion at
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and
confirmation/discussion of the issue on lkml at
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .
Note the comment from the first article saying "those delays can be 30
seconds or more". On multiple occasions, I've measured systems with dozens
of disks in a high-performance RAID1+0 with battery-backed controller that
could grind to a halt for 10, 20, or more seconds in this situation, when
running pgbench on a big database. As was the case on the latest one I saw,
if you've got 32GB of RAM and have let 3.2GB of random I/O from background
writer/checkpoint writes back up because Linux has been lazy about getting
to them, that takes a while to clear no matter how good the underlying
hardware.

Write barriers were supposed to improve all this when added to ext3, but
they just never seemed to work right for many people. After reading that
lkml thread, among others, I know I was left not trusting anything beyond
the simplest path through this area of the filesystem. Slow is better than
corrupted.

So the good news I was relaying is that it looks like this finally work on
ext4, giving it the behavior you described and expected, but that's not
actually been there until now. I was hoping someone with more free time
than me might be interested to go investigate further if I pointed the
advance out. I'm stuck with too many production systems to play with new
kernels at the moment, but am quite curious.

-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services
and Support greg(at)2ndQu(dot)(dot)(dot)

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Wakeling 2010-01-21 12:02:16 Re: a heavy duty operation on an "unused" table kills my server
Previous Message Greg Smith 2010-01-21 08:35:09 Re: Inserting 8MB bytea: just 25% of disk perf used?