Skip site navigation (1) Skip section navigation (2)

Re: Load distributed checkpoint

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Re: Load distributed checkpoint
Date: 2006-12-22 18:53:13
Message-ID: 200612221853.kBMIrDF04797@momjian.us (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
I have a new idea.  Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution.  I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:
	
	http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1&query=update&msection=4&apropos=0
	
	UPDATE(4)                   BSD Programmer's Manual                  UPDATE(4)
	
	NAME
	     update - trickle sync filesystem caches to disk
	
	DESCRIPTION
	     At system boot time, the kernel starts filesys_syncer, process
	     3.  This process helps protect the integrity of disk volumes
	     by ensuring that volatile cached filesystem data are written
	     to disk within the vfs.generic.syncdelay interval which defaults
	     to thirty seconds (see sysctl(8)).  When a vnode is first
	     written it is placed vfs.generic.syncdelay seconds down on
	     the trickle sync queue.  If it still exists and has dirty data
	     when it reaches the top of the queue, filesys_syncer writes
	     it to disk.  This approach evens out the load on the underlying
	     I/O system and avoids writing short-lived files.  The pa- pers
	     on trickle-sync tend to favor aging based on buffers rather
	     than files.  However, BSD/OS synchronizes on file age rather
	     than buffer age because the data structures are much smaller
	     as there are typically far fewer files than buffers.  Although
	     this can make the I/O bursty when a big file is written to
	     disk, it is still much better than the wholesale writes that
	     were being done by the historic update process which wrote
	     all dirty data buffers every 30 seconds.  It also adapts much
	     better to the soft update code which wants to control aging
	     to improve performance (inodes age in one third of
	     vfs.generic.syncdelay seconds, directories in one half of
	     vfs.generic.syncdelay seconds).  This ordering ensures that
	     most dependencies are gone (e.g., inodes are written when
	     directory en- tries want to go to disk) reducing the amount
	     of work that the soft up- date code needs to do.

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.  

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().  Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.  The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us.  The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> 
> I have thought a while about this and I have some ideas.
> 
> Ideally, we would be able to trickle the sync of individuals blocks
> during the checkpoint, but we can't because we rely on the kernel to
> sync all dirty blocks that haven't made it to disk using fsync().  We
> could trickle the fsync() calls, but that just extends the amount of
> data we are writing that has been dirtied post-checkpoint.  In an ideal
> world, we would be able to fsync() only part of a file at a time, and
> only those blocks that were dirtied pre-checkpoint, but I don't see that
> happening anytime soon (and one reason why many commercial databases
> bypass the kernel cache).
> 
> So, in the real world, one conclusion seems to be that our existing
> method of tuning the background writer just isn't good enough for the
> average user:
> 
> 	#bgwriter_delay = 200ms                 # 10-10000ms between rounds
> 	#bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers scanned/round
> 	#bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round
> 	#bgwriter_all_percent = 0.333           # 0-100% of all buffers scanned/round
> 	#bgwriter_all_maxpages = 5              # 0-1000 buffers max written/round
> 
> These settings control what the bgwriter does, but they do not clearly
> relate to the checkpoint timing, which is the purpose of the bgwriter,
> and they don't change during the checkpoint interval, which is also less
> than ideal.  If set to aggressively, it writes too much, and if too low,
> the checkpoint does too much I/O.
> 
> We clearly need more bgwriter activity as the checkpoint approaches, and
> one that is more auto-tuned, like many of our other parameters.  I think
> we created these settings to see how they worked in the field, so it
> probably time to reevaluate them based on field reports.
> 
> I think the bgwriter should keep track of how far it is to the next
> checkpoint, and use that information to increase write activity. 
> Basically now, during a checkpoint, the bgwriter does a full buffer scan
> and fsync's all dirty files, so it changes from the configuration
> parameter-defined behavior right to 100% activity.  I think it would be
> ideal if we could ramp up the writes so that when it is 95% to the next
> checkpoint, it can be operating at 95% of the activity it would do
> during a checkpoint.
> 
> My guess is if we can do that, we will have much smoother performance
> because we have more WAL writes just after checkpoint for newly-dirtied
> pages, and the new setup will give us more write activity just before
> checkpoint.
> 
> One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
> the kernel cache, so we are sure data will be on disk by checkpoint
> time.  This was avoided in the past because of the expense of
> second-guessing the kernel disk I/O scheduling algorithms.
> 
> ---------------------------------------------------------------------------
> 
> Tom Lane wrote:
> > "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> > > "Jim C. Nasby" <jim(at)nasby(dot)net> wrote: 
> > >> Generally, I try and configure the all* settings so that you'll get 1
> > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> > >> have any actual tests to back that methodology up.
> > 
> > > We got to these numbers somewhat scientifically.  I studied I/O
> > > patterns under production load and figured we should be able to handle
> > > about 800 writes in per 200 ms without causing problems.  I have to
> > > admit that I based the percentages and the ratio between "all" and "lru"
> > > on gut feel after musing over the documentation.
> > 
> > I like Kevin's settings better than what Jim suggests.  If the bgwriter
> > only makes one sweep between checkpoints then it's hardly going to make
> > any impact at all on the number of dirty buffers the checkpoint will
> > have to write.  The point of the bgwriter is to reduce the checkpoint
> > I/O spike by doing writes between checkpoints, and to have any
> > meaningful impact on that, you'll need it to make the cycle several times.
> > 
> > Another point here is that you want checkpoints to be pretty far apart
> > to minimize the WAL load from full-page images.  So again, a bgwriter
> > that's only making one loop per checkpoint is not gonna be doing much.
> > 
> > I wonder whether it would be feasible to teach the bgwriter to get more
> > aggressive as the time for the next checkpoint approaches?  Writes
> > issued early in the interval have a much higher probability of being
> > wasted (because the page gets re-dirtied later).  But maybe that just
> > reduces to what Takahiro-san already suggested, namely that
> > checkpoint-time writes should be done with the same kind of scheduling
> > the bgwriter uses outside checkpoints.  We still have the problem that
> > the real I/O storm is triggered by fsync() not write(), and we don't
> > have a way to spread out the consequences of fsync().
> > 
> > 			regards, tom lane
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
> 
> -- 
>   Bruce Momjian   bruce(at)momjian(dot)us
>   EnterpriseDB    http://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

-- 
  Bruce Momjian   bruce(at)momjian(dot)us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

In response to

Responses

pgsql-hackers by date

Next:From: Inaam RanaDate: 2006-12-22 18:56:07
Subject: Re: Load distributed checkpoint
Previous:From: Tom LaneDate: 2006-12-22 18:16:43
Subject: Re: recent --with-libxml support

pgsql-patches by date

Next:From: Inaam RanaDate: 2006-12-22 18:56:07
Subject: Re: Load distributed checkpoint
Previous:From: Zeugswetter Andreas ADI SDDate: 2006-12-22 12:51:31
Subject: Re: Load distributed checkpoint

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group