On Thu, Feb 23, 2012 at 11:17 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> A second fact that's visible from the TPS graphs over the test run, and
> obvious if you think about it, is that BGW writes force data to physical
> disk earlier than they otherwise might go there. That's a subtle pattern in
> the graphs. I expect that though, given one element to "do I write this?"
> in Linux is how old the write is. Wondering about this really emphasises
> that I need to either add graphing of vmstat/iostat data to these graphs or
> switch to a benchmark that does that already. I think I've got just enough
> people using pgbench-tools to justify the feature even if I plan to use the
> program less.
For me, that is the key point.
For the test being performed there is no value in things being written
earlier, since doing so merely overexercises the I/O.
We should note that there is no feedback process in the bgwriter to do
writes only when the level of dirty writes by backends is high enough
to warrant the activity. Note that Linux has a demand paging
algorithm, it doesn't just clean all of the time. That's the reason
you still see some swapping, because that activity is what wakes the
pager. We don't count the number of dirty writes by backends, we just
keep cleaning even when nobody wants it.
Earlier, I pointed out that bgwriter is being woken any time a user
marks a buffer dirty. That is overkill. The bgwriter should stay
asleep until a threshold number (TBD) of dirty writes is reached, then
it should wake up and do some cleaning. Having a continuously active
bgwriter is pointless, for some workloads whereas for others, it
helps. So having a sleeping bgwriter isn't just a power management
issue its a performance issue in some cases.
* Even in cases where there's been little or no buffer allocation
* activity, we want to make a small amount of progress through the buffer
* cache so that as many reusable buffers as possible are clean after an
* idle period.
* (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
* the BGW will be called during the scan_whole_pool time; slice the
* buffer pool into that many sections.
Since scan_whole_pool_milliseconds is set to 2 minutes we scan the
whole bufferpool every 2 minutes, no matter how big the bufferpool,
even when nothing else is happening. Not cool.
I think it would be sensible to have bgwriter stop when 10% of
shared_buffers are clean, rather than keep going even when no dirty
writes are happening.
So my suggestion is that we put in an additional clause into
BgBufferSync() to allow min_scan_buffers to fall to zero when X% of
shared buffers is clean. After that bgwriter should sleep. And be
woken again only by a dirty write by a user backend. That sounds like
clean ratio will flip between 0 and X% but first dirty write will
occur long before we git zero, so that will cause bgwriter to attempt
to maintain a reasonably steady state clean ratio.
I would also take a wild guess that the 750 results are due to
freelist contention. To assess that, I post again the patch shown on
other threads designed to assess the overall level of freelist lwlock
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
In response to
pgsql-hackers by date
|Next:||From: Peter Geoghegan||Date: 2012-02-23 12:37:51|
|Subject: Re: pg_stat_statements normalization: re-review|
|Previous:||From: Marko Kreen||Date: 2012-02-23 12:34:16|
|Subject: Re: Speed dblink using alternate libpq tuple storage|