Re: Load Distributed Checkpoints, take 3

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, take 3
Date: 2007-06-26 13:57:39
Message-ID: 46811B53.3040708@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Tom Lane wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> writes:
>> The way transitions between completely idle and all-out bursts happen were
>> one problematic area I struggled with. Since the LRU point doesn't move
>> during the idle parts, and the lingering buffers have a usage_count>0, the
>> LRU scan won't touch them; the only way to clear out a bunch of dirty
>> buffers leftover from the last burst is with the all-scan.
>
> One thing that might be worth changing is that right now, BgBufferSync
> starts over from the current clock-sweep point on each call --- that is,
> each bgwriter cycle. So it can't really be made to write very many
> buffers without excessive CPU work. Maybe we should redefine it to have
> some static state carried across bgwriter cycles, such that it would
> write at most N dirty buffers per call, but scan through X percent of
> the buffers, possibly across several calls, before returning to the (by
> now probably advanced) clock-sweep point. This would allow a larger
> value of X to be used than is currently practical. You might wish to
> recheck the clock sweep point on each iteration just to make sure the
> scan hasn't fallen behind it, but otherwise I don't see any downside.
> The scenario where somebody re-dirties a buffer that was cleaned by the
> bgwriter scan isn't a problem, because that buffer will also have had its
> usage_count increased and thereby not be a candidate for replacement.

Something along those lines could be useful. I've thought of that
before, but it never occured to me that if a page in front of the clock
hand is re-dirtied, it's no longer a candidate for replacement anyway...

I'm going to leave the all- and lru- bgwriter scans alone for now to get
this LDC patch finished. We still have the bgwriter autotuning patch in
the queue. Let's think about this in the context of that patch.

>> As a general comment on this subject, a lot of the work in LDC presumes
>> you have an accurate notion of how close the next checkpoint is.
>
> Yeah; this is one reason I was interested in carrying some write-speed
> state across checkpoints instead of having the calculation start from
> scratch each time. That wouldn't help systems that sit idle a long time
> and suddenly go nuts, but it seems to me that smoothing the write rate
> across more than one checkpoint could help if the fluctuations occur
> over a timescale of a few checkpoints.

Hmm. This problem only applies to checkpoints triggered by
checkpoint_segments; time tends to move forward at a constant rate.

I'm not worried about small fluctuations or bursts. As I argued earlier,
the OS still buffers the writes and should give some extra smoothing of
the physical writes. I believe bursts of say 50-100 MB would easily fit
in OS cache, as long as there's enough gap between them. I haven't
tested that, though.

Here's a proposal for an algorithm to smooth bigger bursts:

The basic design is the same as before. We keep track of elapsed time
and elapsed xlogs, and based on them we estimate how much progress in
flushing the buffers we should've made by now, and then we catch up
until we reach that. The estimate for time is the same. The estimate for
xlogs gets more complicated:

Let's have a few definitions first:

Ro = elapsed segments / elapsed time, from previous checkpoint cycle.
For example, 1.25 means that the checkpoint was triggered by
checkpoint_segments, and we had spent 1/1.25 = 80% of
checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean
that checkpoint was triggered by checkpoint_timeout, and we had spent
25% of checkpoint_segments by then.

Rn = elapsed segments / elapsed time this far from current in-progress
checkpoint.

t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though
could be > 1 if next checkpoint is already due)
s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 -
1.0, though could again be > 1 if next checkpoint is already due)

R = estimate for WAL segment consumption rate, as checkpoint_segments /
checkpoint_timeout

R = Ro * t + Rn * (1 - t)

In other words, at the beginning of the checkpoint, we give more weight
to the state carried over from previous checkpoint. As we go forward,
more weight is given to the rate calculated from current cycle.

From R, we extrapolate how much progress we should've done by now:

Target progress = R * t

This would require saving just one number from previous cycle (Rn), and
there is no requirement to call the estimation function at steady time
intervals, for example. It gives pretty steady I/O rate even if there's
big bursts in WAL activity, but still reacts to changes in the rate.

I'm not convinced this is worth the effort, though. First of all, this
is only a problem if you use checkpoint_segments to control your
checkpoints, so you can lower checkpoint_timeout to do more work during
the idle periods. Secondly, with the optimization of not flushing
buffers during checkpoint that were dirtied after the start of
checkpoint, the LRU-sweep will also contribute to flushing the buffers
and finishing the checkpoint. We don't count them towards the progress
made ATM, but we probably should. Lastly, distributing the writes even a
little bit is going to be smoother than the current behavior anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-06-26 14:00:36 Re: Load Distributed Checkpoints, take 3
Previous Message Magnus Hagander 2007-06-26 11:43:53 Re: [HACKERS] msvc and vista fun