Skip site navigation (1) Skip section navigation (2)

Re: io storm on checkpoints, postgresql 8.2.4, linux

From: Decibel! <decibel(at)decibel(dot)org>
To: Erik Jones <erik(at)myemma(dot)com>
Cc: Kenneth Marshall <ktm(at)rice(dot)edu>,Dmitry Potapov <fortune(dot)fish(at)gmail(dot)com>,pgsql-performance(at)postgresql(dot)org
Subject: Re: io storm on checkpoints, postgresql 8.2.4, linux
Date: 2007-08-28 21:34:04
Message-ID: 20070828213404.GH1386@nasby.net (view raw or flat)
Thread:
Lists: pgsql-performance
On Tue, Aug 28, 2007 at 10:00:57AM -0500, Erik Jones wrote:
> >>    It seemed strange to me that our 70%-read db generates so much  
> >>dirty
> >>pages that writing them out takes 4-8 seconds and grabs the full  
> >>bandwidth.
> >>First, I started to tune bgwriter to a more aggressive settings,  
> >>but this
> >>was of no help, nearly no performance changes at all. Digging into  
> >>the issue
> >>further, I discovered that linux page cache was the reason. "Dirty"
> >>parameter in /proc/meminfo (which shows the amount of ready-to- 
> >>write "dirty"
> >>data currently sitting in page cache) grows between checkpoints  
> >>from 0 to
> >>about 100Mb. When checkpoint comes, all the 100mb got flushed out  
> >>to disk,
> >>effectively causing a IO storm.
> >>
> >>    I found this (http://www.westnet.com/~gsmith/content/linux- 
> >>pdflush.htm
> >><http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>)  
> >>document and
> >>peeked into mm/page-writeback.c in linux kernel source tree. I'm  
> >>not sure
> >>that I understand pdflush writeout semantics correctly, but looks  
> >>like when
> >>the amount of "dirty" data is less than dirty_background_ratio*RAM/ 
> >>100,
> >>pdflush only writes pages in background, waking up every
> >>dirty_writeback_centisecs and writing no more than 1024 pages
> >>(MAX_WRITEBACK_PAGES constant). When we hit  
> >>dirty_background_ratio, pdflush
> >>starts to write out more agressively.
> >>
> >>    So, looks like the following scenario takes place: postgresql  
> >>constantly
> >>writes something to database and xlog files, dirty data gets to  
> >>the page
> >>cache, and then slowly written out by pdflush. When postgres  
> >>generates more
> >>dirty pages than pdflush writes out, the amount of dirty data in the
> >>pagecache is growing. When we're at checkpoint, postgres does fsync 
> >>() on the
> >>database files, and sleeps until the whole page cache is written out.
> >>
> >>    By default, dirty_background_ratio is 2%, which is about 328Mb  
> >>of 16Gb
> >>total. Following the curring pdflush logic, nearly this amount of  
> >>data we
> >>face to write out on checkpoint effective stalling everything  
> >>else, so even
> >>1% of 16Gb is too much. My setup experience 4-8 sec pause in  
> >>operation even
> >>on ~100Mb dirty pagecache...
> >>
> >>     I temporaly solved this problem by setting  
> >>dirty_background_ratio to
> >>0%. This causes the dirty data to be written out immediately. It  
> >>is ok for
> >>our setup (mostly because of large controller cache), but it  
> >>doesn't looks
> >>to me as an elegant solution. Is there some other way to fix this  
> >>issue
> >>without disabling pagecache and the IO smoothing it was designed  
> >>to perform?
> >
> >You are working at the correct level. The bgwriter performs the I/O  
> >smoothing
> >function at the database level. Obviously, the OS level smoothing  
> >function
> >needed to be tuned and you have done that within the parameters of  
> >the OS.
> >You may want to bring this up on the Linux kernel lists and see if  
> >they have
> >any ideas.
> >
> >Good luck,
> >
> >Ken
> 
> Have you tried decreasing you checkpoint interval?  That would at  
> least help to reduce the amount of data that needs to be flushed when  
> Postgres fsyncs.

The downside to that is it will result in writing a lot more data to WAL
as long as full page writes are on.

Isn't there some kind of a timeout parameter for how long dirty data
will sit in the cache? It seems pretty broken to me to allow stuff to
sit in a dirty state indefinitely.
-- 
Decibel!, aka Jim Nasby                        decibel(at)decibel(dot)org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

In response to

pgsql-performance by date

Next:From: Robins TharakanDate: 2007-08-29 01:50:51
Subject: Re: Performance across multiple schemas
Previous:From: Tom LaneDate: 2007-08-28 21:06:12
Subject: Re: 8.2.4 Chooses Bad Query Plan

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group