Skip site navigation (1) Skip section navigation (2)

Re: Load Distributed Checkpoints, take 3

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, take 3
Date: 2007-06-22 08:18:53
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-patches
Tom Lane wrote:
> 1. checkpoint_rate is used thusly:
> 	writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);
> where writes_per_nap is the max number of dirty blocks to write before
> taking a bgwriter nap.  Now surely this is completely backward: if
> BgWriterDelay is increased, the number of writes to allow per nap
> decreases?  If you think checkpoint_rate is expressed in some kind of
> physical bytes/sec unit, that cannot be right; the number of blocks
> per nap has to increase if the naps get longer.  

Uh, you're right, that's just plain wrong. checkpoint_rate is in 
pages/s, so that line should be

writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000)

> (BTW, the patch seems
> a bit schizoid about whether checkpoint_rate is int or float.)

Yeah, I've gone back and forth on the data type. I wanted it to be a 
float, but guc code doesn't let you specify a float in KB, so I switched 
it to int.

> 2. checkpoint_smoothing is used thusly:
> 	/* scale progress according to CheckPointSmoothing */
> 	progress *= CheckPointSmoothing;
> where the progress value being scaled is the fraction so far completed
> of the total number of dirty pages we expect to have to write.  This
> is then compared against estimates of the total fraction of the time-
> between-checkpoints that has elapsed; if less, we are behind schedule
> and should not nap, if more, we are ahead of schedule and may nap.

> (This is a bit odd, but I guess it's all right because it's equivalent
> to dividing the elapsed-time estimate by CheckPointSmoothing, which
> seems a more natural way of thinking about what needs to happen.)

Yeah, it's a bit unnatural. I did it that way so we don't have to divide 
all three of the estimates: cached_elapsed, progress_in_time and 
progress_in_xlog. Maybe it's not worth micro-optimizing that.

> What's bugging me about this is that we are either going to be writing
> at checkpoint_rate if ahead of schedule, or max possible rate if behind;
> that's not "smoothing" to me, that's introducing some pretty bursty
> behavior.  ISTM that actual "smoothing" would involve adjusting
> writes_per_nap up or down according to whether we are ahead or behind
> schedule, so as to have a finer degree of control over the I/O rate.
> (I'd also consider saving the last writes_per_nap value across
> checkpoints so as to have a more nearly accurate starting value next
> time.)

That sounds a lot more complex. The burstiness at that level shouldn't 
matter much. The OS is still going to cache the writes, and should even 
out the bursts.

Assuming time/xlogs elapse at a steady rate, we will write some multiple 
of writes_per_nap pages between each sleep. With a small writes_per_nap, 
writing just writes_per_nap isn't enough to catch up after we fall 
behind, so we'll write more than that between each sleep. That means 
that on each iteration, we'll write either N*writes_per_nap, or 
(N+1)*writes_per_nap. At worst, that means either writes_per_nap or 
2*writes_per_nap pages on each iteration. That's not too bad, I think.

> In any case I still concur with Takahiro-san that "smoothing" doesn't
> seem the most appropriate name for the parameter.  Something along the
> lines of "checkpoint_completion_target" would convey more about what it
> does, I think.

Ok, I'm not wedded to smoothing.

> And checkpoint_rate really needs to be named checkpoint_min_rate, if
> it's going to be a minimum.  However, I question whether we need it at
> all, because as the code stands, with the default BgWriterDelay you
> would have to increase checkpoint_rate to 4x its proposed default before
> writes_per_nap moves off its minimum of 1. 

Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, 
  using the non-broken formula above, we get:

(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

So I think that's fine. (looking at the patch, I see that the default in 
guc.c is actually 100 pages / s, contrary to documentation; needs to be 

> This says to me that the
> system's tested behavior has been so insensitive to checkpoint_rate
> that we probably need not expose such a parameter at all; just hardwire
> the minimum writes_per_nap at 1.

I've set checkpoint_rate to a small value in my tests on purpose to 
control the feature with the other parameter. That must be why I haven't 
noticed the bogus calculation of it before.

   Heikki Linnakangas

In response to


pgsql-patches by date

Next:From: Magnus HaganderDate: 2007-06-22 10:57:53
Subject: Re: Preliminary GSSAPI Patches
Previous:From: ITAGAKI TakahiroDate: 2007-06-22 05:31:46
Subject: pgstat_drop_relation bugfix

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group