Skip site navigation (1) Skip section navigation (2)

Re: Load distributed checkpoint V4.1

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Load distributed checkpoint V4.1
Date: 2007-04-25 10:45:22
Message-ID: 462F3142.3020006@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
ITAGAKI Takahiro wrote:
> Here is an updated version of LDC patch (V4.1).
> In this release, checkpoints finishes quickly if there is a few dirty pages
> in the buffer pool following the suggestion from Heikki. Thanks.

Excellent, thanks! I was just looking at the results from my test runs 
with version 4. I'll kick off some more tests with this version.

> If the last write phase was finished more quickly than the configuration,
> the next nap phase is also shorten at the same rate. For example, if we
> set checkpoint_write_percent = 50% and the write phase actually finished
> in 25% of checkpoint time, the duration of nap time is adjusted to
> checkpoint_nap_percent * 25% / 50%.

You mean checkpoint_nap_percent * 25% * 50%, I presume, where 50% = 
(actual time spent in write phase)/(checkpoint_write_percent)? Sounds 
good to me.

> In the sync phase, we cut down the duration if there is a few files
> to fsync. We assume that we have storages that throuput is at least
> 10 * bgwriter_all_maxpages (this is arguable). For example, when
> bgwriter_delay=200ms and bgwriter_all_maxpages=5, we assume that
> we can use 2MB/s of flush throughput (10 * 5page * 8kB / 200ms).
> If there is 200MB of files to fsync, the duration of sync phase is
> cut down to 100sec even if the duration is shorter than
> checkpoint_sync_percent * checkpoint_timeout.

Sounds reasonable. 10 * bgwriter_all_maxpages is indeed quite arbitrary, 
but it should be enough to eliminate ridiculously long waits if there's 
very little work to do. Or we could do the same thing you did with the 
nap phase, scaling down the time allocated for sync phase by the ratio 
of (actual time spent in write phase)/(checkpoint_write_percent). Using 
the same mechanism in nap and sync phases sounds like a good idea.

> I use bgwriter_all_maxpages as something like 'reserved band of storage
> for bgwriter' here. If there is a better name for it, please rename it.

How about checkpoint_aggressiveness? Or checkpoint_throughput? I think 
the correct metric is (k/M)bytes/sec, making it independent of 
bgwriter_delay.

Do we want the same setting to be used for bgwriter_all_maxpages? I 
don't think we have a reason to believe the same value is good for both. 
In fact I think we should just get rid of bgwriter_all_* eventually, but 
as Greg Smith pointed out we need more testing before we can do that :).

There's one more optimization I'd like to have. Checkpoint scans through 
*all* dirty buffers and writes them out. However, some of those dirty 
buffers might have been dirtied *after* the start of the checkpoint, and 
flushing them is a waste of I/O if they get dirtied again before the 
next checkpoint. Even if they don't, it seems better to not force them 
to disk at checkpoint, checkpoint is heavy enough without any extra I/O. 
It didn't make much difference without LDC, because we tried to complete 
the writes as soon as possible so there wasn't a big window for that to 
happen, but now that we spread out the writes it makes a lot of sense. I 
wrote a quick & dirty patch to implement that, and at least in my test 
case it does make some difference.

Here's results of some tests I ran with LDC v4.0:

http://community.enterprisedb.com/ldc/

Imola-164 is the a baseline run with CVS HEAD, with 
bgwriter_all_maxpages and bgwriter_all_percent set to zero. I've 
disabled think times in the test to make the checkpoint problem more 
severe. Imola-162 is the same test with LDC patch applied. In Imola-163, 
bgwriter_all_maxpages was set to 10. These runs show that the patch 
clearly works; the response times during a checkpoint are much better. 
Imola-163 is even better, which demonstrates that using 
WRITES_PER_ABSORB (1000) in the absence of bgwriter_all_maxpages isn't a 
good idea.

Imola-165 is the same as imola-163, but it has the optimization applied 
I mentioned above. Only those dirty pages are written that are necessary 
for a coherent checkpoint. The results look roughly the same, except 
that imola-165 achieves a slightly higher total TPM, and the pits in the 
TPM graph are slightly shallower.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

pgsql-hackers by date

Next:From: Dave PageDate: 2007-04-25 12:31:18
Subject: Re: Buildfarm: Stage logs not available for MSVC builds
Previous:From: Andrew DunstanDate: 2007-04-25 09:59:05
Subject: Re: Buildfarm: Stage logs not available for MSVC builds

pgsql-patches by date

Next:From: Kenneth MarshallDate: 2007-04-25 12:56:49
Subject: Re: [HACKERS] Full page writes improvement, code update
Previous:From: ITAGAKI TakahiroDate: 2007-04-25 09:16:39
Subject: Load distributed checkpoint V4.1

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group