Re: io storm on checkpoints, postgresql 8.2.4, linux

From: Decibel! <decibel(at)decibel(dot)org>
To: Erik Jones <erik(at)myemma(dot)com>
Cc: Kenneth Marshall <ktm(at)rice(dot)edu>, Dmitry Potapov <fortune(dot)fish(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: io storm on checkpoints, postgresql 8.2.4, linux
Date: 2007-08-28 21:34:04
Message-ID: 20070828213404.GH1386@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Aug 28, 2007 at 10:00:57AM -0500, Erik Jones wrote:
> >> It seemed strange to me that our 70%-read db generates so much
> >>dirty
> >>pages that writing them out takes 4-8 seconds and grabs the full
> >>bandwidth.
> >>First, I started to tune bgwriter to a more aggressive settings,
> >>but this
> >>was of no help, nearly no performance changes at all. Digging into
> >>the issue
> >>further, I discovered that linux page cache was the reason. "Dirty"
> >>parameter in /proc/meminfo (which shows the amount of ready-to-
> >>write "dirty"
> >>data currently sitting in page cache) grows between checkpoints
> >>from 0 to
> >>about 100Mb. When checkpoint comes, all the 100mb got flushed out
> >>to disk,
> >>effectively causing a IO storm.
> >>
> >> I found this (http://www.westnet.com/~gsmith/content/linux-
> >>pdflush.htm
> >><http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>)
> >>document and
> >>peeked into mm/page-writeback.c in linux kernel source tree. I'm
> >>not sure
> >>that I understand pdflush writeout semantics correctly, but looks
> >>like when
> >>the amount of "dirty" data is less than dirty_background_ratio*RAM/
> >>100,
> >>pdflush only writes pages in background, waking up every
> >>dirty_writeback_centisecs and writing no more than 1024 pages
> >>(MAX_WRITEBACK_PAGES constant). When we hit
> >>dirty_background_ratio, pdflush
> >>starts to write out more agressively.
> >>
> >> So, looks like the following scenario takes place: postgresql
> >>constantly
> >>writes something to database and xlog files, dirty data gets to
> >>the page
> >>cache, and then slowly written out by pdflush. When postgres
> >>generates more
> >>dirty pages than pdflush writes out, the amount of dirty data in the
> >>pagecache is growing. When we're at checkpoint, postgres does fsync
> >>() on the
> >>database files, and sleeps until the whole page cache is written out.
> >>
> >> By default, dirty_background_ratio is 2%, which is about 328Mb
> >>of 16Gb
> >>total. Following the curring pdflush logic, nearly this amount of
> >>data we
> >>face to write out on checkpoint effective stalling everything
> >>else, so even
> >>1% of 16Gb is too much. My setup experience 4-8 sec pause in
> >>operation even
> >>on ~100Mb dirty pagecache...
> >>
> >> I temporaly solved this problem by setting
> >>dirty_background_ratio to
> >>0%. This causes the dirty data to be written out immediately. It
> >>is ok for
> >>our setup (mostly because of large controller cache), but it
> >>doesn't looks
> >>to me as an elegant solution. Is there some other way to fix this
> >>issue
> >>without disabling pagecache and the IO smoothing it was designed
> >>to perform?
> >
> >You are working at the correct level. The bgwriter performs the I/O
> >smoothing
> >function at the database level. Obviously, the OS level smoothing
> >function
> >needed to be tuned and you have done that within the parameters of
> >the OS.
> >You may want to bring this up on the Linux kernel lists and see if
> >they have
> >any ideas.
> >
> >Good luck,
> >
> >Ken
>
> Have you tried decreasing you checkpoint interval? That would at
> least help to reduce the amount of data that needs to be flushed when
> Postgres fsyncs.

The downside to that is it will result in writing a lot more data to WAL
as long as full page writes are on.

Isn't there some kind of a timeout parameter for how long dirty data
will sit in the cache? It seems pretty broken to me to allow stuff to
sit in a dirty state indefinitely.
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Robins Tharakan 2007-08-29 01:50:51 Re: Performance across multiple schemas
Previous Message Tom Lane 2007-08-28 21:06:12 Re: 8.2.4 Chooses Bad Query Plan