Re: Load distributed checkpoint V4.1

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Load distributed checkpoint V4.1
Date: 2007-04-25 10:45:22
Message-ID: 462F3142.3020006@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro wrote:
> Here is an updated version of LDC patch (V4.1).
> In this release, checkpoints finishes quickly if there is a few dirty pages
> in the buffer pool following the suggestion from Heikki. Thanks.

Excellent, thanks! I was just looking at the results from my test runs
with version 4. I'll kick off some more tests with this version.

> If the last write phase was finished more quickly than the configuration,
> the next nap phase is also shorten at the same rate. For example, if we
> set checkpoint_write_percent = 50% and the write phase actually finished
> in 25% of checkpoint time, the duration of nap time is adjusted to
> checkpoint_nap_percent * 25% / 50%.

You mean checkpoint_nap_percent * 25% * 50%, I presume, where 50% =
(actual time spent in write phase)/(checkpoint_write_percent)? Sounds
good to me.

> In the sync phase, we cut down the duration if there is a few files
> to fsync. We assume that we have storages that throuput is at least
> 10 * bgwriter_all_maxpages (this is arguable). For example, when
> bgwriter_delay=200ms and bgwriter_all_maxpages=5, we assume that
> we can use 2MB/s of flush throughput (10 * 5page * 8kB / 200ms).
> If there is 200MB of files to fsync, the duration of sync phase is
> cut down to 100sec even if the duration is shorter than
> checkpoint_sync_percent * checkpoint_timeout.

Sounds reasonable. 10 * bgwriter_all_maxpages is indeed quite arbitrary,
but it should be enough to eliminate ridiculously long waits if there's
very little work to do. Or we could do the same thing you did with the
nap phase, scaling down the time allocated for sync phase by the ratio
of (actual time spent in write phase)/(checkpoint_write_percent). Using
the same mechanism in nap and sync phases sounds like a good idea.

> I use bgwriter_all_maxpages as something like 'reserved band of storage
> for bgwriter' here. If there is a better name for it, please rename it.

How about checkpoint_aggressiveness? Or checkpoint_throughput? I think
the correct metric is (k/M)bytes/sec, making it independent of
bgwriter_delay.

Do we want the same setting to be used for bgwriter_all_maxpages? I
don't think we have a reason to believe the same value is good for both.
In fact I think we should just get rid of bgwriter_all_* eventually, but
as Greg Smith pointed out we need more testing before we can do that :).

There's one more optimization I'd like to have. Checkpoint scans through
*all* dirty buffers and writes them out. However, some of those dirty
buffers might have been dirtied *after* the start of the checkpoint, and
flushing them is a waste of I/O if they get dirtied again before the
next checkpoint. Even if they don't, it seems better to not force them
to disk at checkpoint, checkpoint is heavy enough without any extra I/O.
It didn't make much difference without LDC, because we tried to complete
the writes as soon as possible so there wasn't a big window for that to
happen, but now that we spread out the writes it makes a lot of sense. I
wrote a quick & dirty patch to implement that, and at least in my test
case it does make some difference.

Here's results of some tests I ran with LDC v4.0:

http://community.enterprisedb.com/ldc/

Imola-164 is the a baseline run with CVS HEAD, with
bgwriter_all_maxpages and bgwriter_all_percent set to zero. I've
disabled think times in the test to make the checkpoint problem more
severe. Imola-162 is the same test with LDC patch applied. In Imola-163,
bgwriter_all_maxpages was set to 10. These runs show that the patch
clearly works; the response times during a checkpoint are much better.
Imola-163 is even better, which demonstrates that using
WRITES_PER_ABSORB (1000) in the absence of bgwriter_all_maxpages isn't a
good idea.

Imola-165 is the same as imola-163, but it has the optimization applied
I mentioned above. Only those dirty pages are written that are necessary
for a coherent checkpoint. The results look roughly the same, except
that imola-165 achieves a slightly higher total TPM, and the pits in the
TPM graph are slightly shallower.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2007-04-25 12:31:18 Re: Buildfarm: Stage logs not available for MSVC builds
Previous Message Andrew Dunstan 2007-04-25 09:59:05 Re: Buildfarm: Stage logs not available for MSVC builds

Browse pgsql-patches by date

  From Date Subject
Next Message Kenneth Marshall 2007-04-25 12:56:49 Re: [HACKERS] Full page writes improvement, code update
Previous Message ITAGAKI Takahiro 2007-04-25 09:16:39 Load distributed checkpoint V4.1