Re: Bgwriter strategies

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 09:55:28
Message-ID: 468E1190.8050902@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Smith wrote:
> On Thu, 5 Jul 2007, Heikki Linnakangas wrote:
>
>> It looks like Tom's idea is not a winner; it leads to more writes than
>> necessary.
>
> What I came away with as the core of Tom's idea is that the cleaning/LRU
> writer shouldn't ever scan the same section of the buffer cache twice,
> because anything that resulted in a new dirty buffer will be unwritable
> by it until the clock sweep passes over it. I never took that to mean
> that idea necessarily had to be implemented as "trying to aggressively
> keep all pages with usage_count=0 clean".
>
> I've been making slow progress on this myself, and the question I've
> been trying to answer is whether this fundamental idea really matters or
> not. One clear benefit of that alternate implementation should allow is
> setting a lower value for the interval without being as concerned that
> you're wasting resources by doing so, which I've found to a problem with
> the current implementation--it will consume a lot of CPU scanning the
> same section right now if you lower that too much.

Yes, in fact ignoring the CPU overhead of scanning the same section over
and over again, Tom's proposal is the same as setting both
bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2
test like that as well, and the # of writes was indeed the same, just
with a max higher CPU usage. It's clear that scanning the same section
over and over again has been a waste of time in previous releases.

As a further data point, I constructed a smaller test case that performs
random DELETEs on a table using an index. I varied the # of
shared_buffers, and ran the test with bgwriter disabled, or tuned all
the way up to the maximum. Here's the results from that:

shared_buffers | writes | writes | writes_ratio
----------------+--------+--------+-------------------
2560 | 86936 | 88023 | 1.01250345081439
5120 | 81207 | 84551 | 1.04117871612053
7680 | 75367 | 80603 | 1.06947337694216
10240 | 69772 | 74533 | 1.06823654187926
12800 | 64281 | 69237 | 1.07709898725907
15360 | 58515 | 64735 | 1.10629753054772
17920 | 53231 | 58635 | 1.10151979109917
20480 | 48128 | 54403 | 1.13038148271277
23040 | 43087 | 49949 | 1.15925917330053
25600 | 39062 | 46477 | 1.1898264297783
28160 | 35391 | 43739 | 1.23587917832217
30720 | 32713 | 37480 | 1.14572188426619
33280 | 31634 | 31677 | 1.00135929695897
35840 | 31668 | 31717 | 1.00154730327144
38400 | 31696 | 31693 | 0.999905350832913
40960 | 31685 | 31730 | 1.00142023039293
43520 | 31694 | 31650 | 0.998611724616647
46080 | 31661 | 31650 | 0.999652569407157

The first writes-column is the # of writes with bgwriter disabled, 2nd
column is with the aggressive bgwriter. The table size is 33334 pages,
so after that the table fits in cache and the bgwriter strategy makes no
difference.

> As far as your results, first off I'm really glad to see someone else
> comparing checkpoint/backend/bgwriter writes the same I've been doing so
> I finally have someone else's results to compare against. I expect that
> the optimal approach here is a hybrid one that structures scanning the
> buffer cache the new way Tom suggests, but limits the number of writes
> to "just enough". I happen to be fond of the "just enough" computation
> based on a weighted moving average I wrote before, but there's certainly
> room for multiple implementations of that part of the code to evolve.

We need to get the requirements straight.

One goal of bgwriter is clearly to keep just enough buffers clean in
front of the clock hand so that backends don't need to do writes
themselves until the next bgwriter iteration. But not any more than
that, otherwise we might end up doing more writes than necessary if some
of the buffers are redirtied.

To deal with bursty workloads, for example a batch of 2 GB worth of
inserts coming in every 10 minutes, it seems we want to keep doing a
little bit of cleaning even when the system is idle, to prepare for the
next burst. The idea is to smoothen the physical I/O bursts; if we don't
clean the dirty buffers left over from the previous burst during the
idle period, the I/O system will be bottlenecked during the bursts, and
sit idle otherwise.

To strike a balance between cleaning buffers ahead of possible bursts in
the future and not doing unnecessary I/O when no such bursts come, I
think a reasonable strategy is to write buffers with usage_count=0 at a
slow pace when there's no buffer allocations happening.

To smoothen the small variations on a relatively steady workload, the
weighted average sounds good.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2007-07-06 10:01:30 Re: Bgwriter strategies
Previous Message Pelle Johansson 2007-07-06 08:44:08 BUG #3431: age() gets the days wrong