Skip site navigation (1) Skip section navigation (2)

Re: Load Distributed Checkpoints, take 3

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, take 3
Date: 2007-06-26 13:57:39
Message-ID: (view raw or whole thread)
Lists: pgsql-patches
Tom Lane wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> writes:
>> The way transitions between completely idle and all-out bursts happen were 
>> one problematic area I struggled with.  Since the LRU point doesn't move 
>> during the idle parts, and the lingering buffers have a usage_count>0, the 
>> LRU scan won't touch them; the only way to clear out a bunch of dirty 
>> buffers leftover from the last burst is with the all-scan.
> One thing that might be worth changing is that right now, BgBufferSync
> starts over from the current clock-sweep point on each call --- that is,
> each bgwriter cycle.  So it can't really be made to write very many
> buffers without excessive CPU work.  Maybe we should redefine it to have
> some static state carried across bgwriter cycles, such that it would
> write at most N dirty buffers per call, but scan through X percent of
> the buffers, possibly across several calls, before returning to the (by
> now probably advanced) clock-sweep point.  This would allow a larger
> value of X to be used than is currently practical.  You might wish to
> recheck the clock sweep point on each iteration just to make sure the
> scan hasn't fallen behind it, but otherwise I don't see any downside.
> The scenario where somebody re-dirties a buffer that was cleaned by the
> bgwriter scan isn't a problem, because that buffer will also have had its
> usage_count increased and thereby not be a candidate for replacement.

Something along those lines could be useful. I've thought of that 
before, but it never occured to me that if a page in front of the clock 
hand is re-dirtied, it's no longer a candidate for replacement anyway...

I'm going to leave the all- and lru- bgwriter scans alone for now to get 
this LDC patch finished. We still have the bgwriter autotuning patch in 
the queue. Let's think about this in the context of that patch.

>> As a general comment on this subject, a lot of the work in LDC presumes 
>> you have an accurate notion of how close the next checkpoint is.
> Yeah; this is one reason I was interested in carrying some write-speed
> state across checkpoints instead of having the calculation start from
> scratch each time.  That wouldn't help systems that sit idle a long time
> and suddenly go nuts, but it seems to me that smoothing the write rate
> across more than one checkpoint could help if the fluctuations occur
> over a timescale of a few checkpoints.

Hmm. This problem only applies to checkpoints triggered by 
checkpoint_segments; time tends to move forward at a constant rate.

I'm not worried about small fluctuations or bursts. As I argued earlier, 
the OS still buffers the writes and should give some extra smoothing of 
the physical writes. I believe bursts of say 50-100 MB would easily fit 
in OS cache, as long as there's enough gap between them. I haven't 
tested that, though.

Here's a proposal for an algorithm to smooth bigger bursts:

The basic design is the same as before. We keep track of elapsed time 
and elapsed xlogs, and based on them we estimate how much progress in 
flushing the buffers we should've made by now, and then we catch up 
until we reach that. The estimate for time is the same. The estimate for 
xlogs gets more complicated:

Let's have a few definitions first:

Ro = elapsed segments / elapsed time, from previous checkpoint cycle. 
For example, 1.25 means that the checkpoint was triggered by 
checkpoint_segments, and we had spent 1/1.25 =  80% of 
checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean 
that checkpoint was triggered by checkpoint_timeout, and we had spent 
25% of checkpoint_segments by then.

Rn = elapsed segments / elapsed time this far from current in-progress 

t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though 
could be > 1 if next checkpoint is already due)
s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 - 
1.0, though could again be > 1 if next checkpoint is already due)

R = estimate for WAL segment consumption rate, as checkpoint_segments / 

R = Ro * t + Rn * (1 - t)

In other words, at the beginning of the checkpoint, we give more weight 
to the state carried over from previous checkpoint. As we go forward, 
more weight is given to the rate calculated from current cycle.

 From R, we extrapolate how much progress we should've done by now:

Target progress = R * t

This would require saving just one number from previous cycle (Rn), and 
there is no requirement to call the estimation function at steady time 
intervals, for example. It gives pretty steady I/O rate even if there's 
big bursts in WAL activity, but still reacts to changes in the rate.

I'm not convinced this is worth the effort, though. First of all, this 
is only a problem if you use checkpoint_segments to control your 
checkpoints, so you can lower checkpoint_timeout to do more work during 
the idle periods. Secondly, with the optimization of not flushing 
buffers during checkpoint that were dirtied after the start of 
checkpoint, the LRU-sweep will also contribute to flushing the buffers 
and finishing the checkpoint. We don't count them towards the progress 
made ATM, but we probably should. Lastly, distributing the writes even a 
little bit is going to be smoother than the current behavior anyway.

   Heikki Linnakangas

In response to

pgsql-patches by date

Next:From: Tom LaneDate: 2007-06-26 14:00:36
Subject: Re: Load Distributed Checkpoints, take 3
Previous:From: Magnus HaganderDate: 2007-06-26 11:43:53
Subject: Re: [HACKERS] msvc and vista fun

Privacy Policy | About PostgreSQL
Copyright © 1996-2016 The PostgreSQL Global Development Group