Skip site navigation (1) Skip section navigation (2)

Re: Load distributed checkpoint V4

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Load distributed checkpoint V4
Date: 2007-04-19 11:02:19
Message-ID: 46274C3B.2020804@iki.fi (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
ITAGAKI Takahiro wrote:
> Here is an updated version of LDC patch (V4).

Thanks! I'll start testing.

> - Progress of checkpoint is controlled not only based on checkpoint_timeout
>   but also checkpoint_segments. -- Now it works better with large
>   checkpoint_timeout and small checkpoint_segments.

Great, much better now. I like the concept of "progress" used in the 
calculations. We might want to call GetCheckpointProgress something 
else, though. It doesn't return the amount of progress made, but rather 
the amount of progress we should've made up to that point or we're in 
danger of not completing the checkpoint in time.

> We can control the delay of checkpoints using three parameters:
> checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
> If we set all of the values to zero, checkpoint behaves as it was.

The nap and sync phases are pretty straightforward. The write phase, 
however, behaves a bit differently

In the nap phase, we just sleep until enough time/segments has passed, 
where enough is defined by checkpoint_nap_percent. However, if we're 
already past checkpoint_write_percent at the beginning of the nap, I 
think we should clamp the nap time so that we don't run out of time 
until the next checkpoint because of sleeping.

In the sync phase, we sleep between each fsync until enough 
time/segments have passed, assuming that the time to fsync is 
proportional to the file length. I'm not sure that's a very good 
assumption. We might have one huge files with only very little changed 
data, for example a logging table that is just occasionaly appended to. 
If we begin by fsyncing that, it'll take a very short time to finish, 
and we'll then sleep for a long time. If we then have another large file 
to fsync, but that one has all pages dirty, we risk running out of time 
because of the unnecessarily long sleep. The segmentation of relations 
limits the risk of that, though, by limiting the max. file size, and I 
don't really have any better suggestions.

In the write phase, bgwriter_all_maxpages is also factored in the 
sleeps. On each iteration, we write bgwriter_all_maxpages pages and then 
we sleep for bgwriter_delay msecs. checkpoint_write_percent only 
controls the maximum amount of time we try spend in the write phase, we 
skip the sleeps if we're exceeding checkpoint_write_percent, but it can 
very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum* 
amount of pages to write between sleeps. If it's not set, we use 
WRITERS_PER_ABSORB, which is hardcoded to 1000.

The approach of writing min. N pages per iteration seems sound to me. By 
setting N we can control the maximum impact of a checkpoint under normal 
circumstances. If there's very little work to do, it doesn't make sense 
to stretch the write of say 10 buffers across a 15 min period; it's 
indeed better to finish the checkpoint earlier. It's similar to 
vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it 
doesn't feel right, we should at least name it differently. The default 
of 1000 is a bit high as well, with the default bgwriter_delay that adds 
up to 39MB/s. That's ok for decent a I/O subsystem, but the default 
really should be something that will still leave room for other I/O on a 
small single-disk server.

Should we try doing something similar for the sync phase? If there's 
only 2 small files to fsync, there's no point sleeping for 5 minutes 
between them just to use up the checkpoint_sync_percent budget.

Should we give a warning if you set the *_percent settings so that they 
exceed 100%?

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

Responses

pgsql-hackers by date

Next:From: Markus SchiltknechtDate: 2007-04-19 11:22:20
Subject: Re: Hacking on PostgreSQL via GIT
Previous:From: Marcin WaldowskiDate: 2007-04-19 09:43:03
Subject: BUG #3242: FATAL: could not unlock semaphore: error code 298

pgsql-patches by date

Next:From: Heikki LinnakangasDate: 2007-04-19 12:16:32
Subject: HOT + MVCC-safe cluster conflict fix
Previous:From: Zoltan BoszormenyiDate: 2007-04-19 09:19:40
Subject: parser dilemma

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group