On Tue, Aug 28, 2007 at 10:00:57AM -0500, Erik Jones wrote:
> >> It seemed strange to me that our 70%-read db generates so much
> >>pages that writing them out takes 4-8 seconds and grabs the full
> >>First, I started to tune bgwriter to a more aggressive settings,
> >>but this
> >>was of no help, nearly no performance changes at all. Digging into
> >>the issue
> >>further, I discovered that linux page cache was the reason. "Dirty"
> >>parameter in /proc/meminfo (which shows the amount of ready-to-
> >>write "dirty"
> >>data currently sitting in page cache) grows between checkpoints
> >>from 0 to
> >>about 100Mb. When checkpoint comes, all the 100mb got flushed out
> >>to disk,
> >>effectively causing a IO storm.
> >> I found this (http://www.westnet.com/~gsmith/content/linux-
> >>document and
> >>peeked into mm/page-writeback.c in linux kernel source tree. I'm
> >>not sure
> >>that I understand pdflush writeout semantics correctly, but looks
> >>like when
> >>the amount of "dirty" data is less than dirty_background_ratio*RAM/
> >>pdflush only writes pages in background, waking up every
> >>dirty_writeback_centisecs and writing no more than 1024 pages
> >>(MAX_WRITEBACK_PAGES constant). When we hit
> >>dirty_background_ratio, pdflush
> >>starts to write out more agressively.
> >> So, looks like the following scenario takes place: postgresql
> >>writes something to database and xlog files, dirty data gets to
> >>the page
> >>cache, and then slowly written out by pdflush. When postgres
> >>generates more
> >>dirty pages than pdflush writes out, the amount of dirty data in the
> >>pagecache is growing. When we're at checkpoint, postgres does fsync
> >>() on the
> >>database files, and sleeps until the whole page cache is written out.
> >> By default, dirty_background_ratio is 2%, which is about 328Mb
> >>of 16Gb
> >>total. Following the curring pdflush logic, nearly this amount of
> >>data we
> >>face to write out on checkpoint effective stalling everything
> >>else, so even
> >>1% of 16Gb is too much. My setup experience 4-8 sec pause in
> >>operation even
> >>on ~100Mb dirty pagecache...
> >> I temporaly solved this problem by setting
> >>dirty_background_ratio to
> >>0%. This causes the dirty data to be written out immediately. It
> >>is ok for
> >>our setup (mostly because of large controller cache), but it
> >>doesn't looks
> >>to me as an elegant solution. Is there some other way to fix this
> >>without disabling pagecache and the IO smoothing it was designed
> >>to perform?
> >You are working at the correct level. The bgwriter performs the I/O
> >function at the database level. Obviously, the OS level smoothing
> >needed to be tuned and you have done that within the parameters of
> >the OS.
> >You may want to bring this up on the Linux kernel lists and see if
> >they have
> >any ideas.
> >Good luck,
> Have you tried decreasing you checkpoint interval? That would at
> least help to reduce the amount of data that needs to be flushed when
> Postgres fsyncs.
The downside to that is it will result in writing a lot more data to WAL
as long as full page writes are on.
Isn't there some kind of a timeout parameter for how long dirty data
will sit in the cache? It seems pretty broken to me to allow stuff to
sit in a dirty state indefinitely.
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
In response to
pgsql-performance by date
|Next:||From: Robins Tharakan||Date: 2007-08-29 01:50:51|
|Subject: Re: Performance across multiple schemas|
|Previous:||From: Tom Lane||Date: 2007-08-28 21:06:12|
|Subject: Re: 8.2.4 Chooses Bad Query Plan |