Tom Lane wrote:
> 1. checkpoint_rate is used thusly:
> writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);
> where writes_per_nap is the max number of dirty blocks to write before
> taking a bgwriter nap. Now surely this is completely backward: if
> BgWriterDelay is increased, the number of writes to allow per nap
> decreases? If you think checkpoint_rate is expressed in some kind of
> physical bytes/sec unit, that cannot be right; the number of blocks
> per nap has to increase if the naps get longer.
Uh, you're right, that's just plain wrong. checkpoint_rate is in
pages/s, so that line should be
writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000)
> (BTW, the patch seems
> a bit schizoid about whether checkpoint_rate is int or float.)
Yeah, I've gone back and forth on the data type. I wanted it to be a
float, but guc code doesn't let you specify a float in KB, so I switched
it to int.
> 2. checkpoint_smoothing is used thusly:
> /* scale progress according to CheckPointSmoothing */
> progress *= CheckPointSmoothing;
> where the progress value being scaled is the fraction so far completed
> of the total number of dirty pages we expect to have to write. This
> is then compared against estimates of the total fraction of the time-
> between-checkpoints that has elapsed; if less, we are behind schedule
> and should not nap, if more, we are ahead of schedule and may nap.
> (This is a bit odd, but I guess it's all right because it's equivalent
> to dividing the elapsed-time estimate by CheckPointSmoothing, which
> seems a more natural way of thinking about what needs to happen.)
Yeah, it's a bit unnatural. I did it that way so we don't have to divide
all three of the estimates: cached_elapsed, progress_in_time and
progress_in_xlog. Maybe it's not worth micro-optimizing that.
> What's bugging me about this is that we are either going to be writing
> at checkpoint_rate if ahead of schedule, or max possible rate if behind;
> that's not "smoothing" to me, that's introducing some pretty bursty
> behavior. ISTM that actual "smoothing" would involve adjusting
> writes_per_nap up or down according to whether we are ahead or behind
> schedule, so as to have a finer degree of control over the I/O rate.
> (I'd also consider saving the last writes_per_nap value across
> checkpoints so as to have a more nearly accurate starting value next
That sounds a lot more complex. The burstiness at that level shouldn't
matter much. The OS is still going to cache the writes, and should even
out the bursts.
Assuming time/xlogs elapse at a steady rate, we will write some multiple
of writes_per_nap pages between each sleep. With a small writes_per_nap,
writing just writes_per_nap isn't enough to catch up after we fall
behind, so we'll write more than that between each sleep. That means
that on each iteration, we'll write either N*writes_per_nap, or
(N+1)*writes_per_nap. At worst, that means either writes_per_nap or
2*writes_per_nap pages on each iteration. That's not too bad, I think.
> In any case I still concur with Takahiro-san that "smoothing" doesn't
> seem the most appropriate name for the parameter. Something along the
> lines of "checkpoint_completion_target" would convey more about what it
> does, I think.
Ok, I'm not wedded to smoothing.
> And checkpoint_rate really needs to be named checkpoint_min_rate, if
> it's going to be a minimum. However, I question whether we need it at
> all, because as the code stands, with the default BgWriterDelay you
> would have to increase checkpoint_rate to 4x its proposed default before
> writes_per_nap moves off its minimum of 1.
Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s,
using the non-broken formula above, we get:
(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.
So I think that's fine. (looking at the patch, I see that the default in
guc.c is actually 100 pages / s, contrary to documentation; needs to be
> This says to me that the
> system's tested behavior has been so insensitive to checkpoint_rate
> that we probably need not expose such a parameter at all; just hardwire
> the minimum writes_per_nap at 1.
I've set checkpoint_rate to a small value in my tests on purpose to
control the feature with the other parameter. That must be why I haven't
noticed the bogus calculation of it before.
In response to
pgsql-patches by date
|Next:||From: Magnus Hagander||Date: 2007-06-22 10:57:53|
|Subject: Re: Preliminary GSSAPI Patches|
|Previous:||From: ITAGAKI Takahiro||Date: 2007-06-22 05:31:46|
|Subject: pgstat_drop_relation bugfix|