Re: Load distributed checkpoint

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Re: Load distributed checkpoint
Date: 2006-12-22 18:53:13
Message-ID: 200612221853.kBMIrDF04797@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches


I have a new idea. Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution. I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:

http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1&query=update&msection=4&apropos=0

UPDATE(4) BSD Programmer's Manual UPDATE(4)

NAME
update - trickle sync filesystem caches to disk

DESCRIPTION
At system boot time, the kernel starts filesys_syncer, process
3. This process helps protect the integrity of disk volumes
by ensuring that volatile cached filesystem data are written
to disk within the vfs.generic.syncdelay interval which defaults
to thirty seconds (see sysctl(8)). When a vnode is first
written it is placed vfs.generic.syncdelay seconds down on
the trickle sync queue. If it still exists and has dirty data
when it reaches the top of the queue, filesys_syncer writes
it to disk. This approach evens out the load on the underlying
I/O system and avoids writing short-lived files. The pa- pers
on trickle-sync tend to favor aging based on buffers rather
than files. However, BSD/OS synchronizes on file age rather
than buffer age because the data structures are much smaller
as there are typically far fewer files than buffers. Although
this can make the I/O bursty when a big file is written to
disk, it is still much better than the wholesale writes that
were being done by the historic update process which wrote
all dirty data buffers every 30 seconds. It also adapts much
better to the soft update code which wants to control aging
to improve performance (inodes age in one third of
vfs.generic.syncdelay seconds, directories in one half of
vfs.generic.syncdelay seconds). This ordering ensures that
most dependencies are gone (e.g., inodes are written when
directory en- tries want to go to disk) reducing the amount
of work that the soft up- date code needs to do.

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync(). Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files. The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us. The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

---------------------------------------------------------------------------

Bruce Momjian wrote:
>
> I have thought a while about this and I have some ideas.
>
> Ideally, we would be able to trickle the sync of individuals blocks
> during the checkpoint, but we can't because we rely on the kernel to
> sync all dirty blocks that haven't made it to disk using fsync(). We
> could trickle the fsync() calls, but that just extends the amount of
> data we are writing that has been dirtied post-checkpoint. In an ideal
> world, we would be able to fsync() only part of a file at a time, and
> only those blocks that were dirtied pre-checkpoint, but I don't see that
> happening anytime soon (and one reason why many commercial databases
> bypass the kernel cache).
>
> So, in the real world, one conclusion seems to be that our existing
> method of tuning the background writer just isn't good enough for the
> average user:
>
> #bgwriter_delay = 200ms # 10-10000ms between rounds
> #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
> #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
> #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
> #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round
>
> These settings control what the bgwriter does, but they do not clearly
> relate to the checkpoint timing, which is the purpose of the bgwriter,
> and they don't change during the checkpoint interval, which is also less
> than ideal. If set to aggressively, it writes too much, and if too low,
> the checkpoint does too much I/O.
>
> We clearly need more bgwriter activity as the checkpoint approaches, and
> one that is more auto-tuned, like many of our other parameters. I think
> we created these settings to see how they worked in the field, so it
> probably time to reevaluate them based on field reports.
>
> I think the bgwriter should keep track of how far it is to the next
> checkpoint, and use that information to increase write activity.
> Basically now, during a checkpoint, the bgwriter does a full buffer scan
> and fsync's all dirty files, so it changes from the configuration
> parameter-defined behavior right to 100% activity. I think it would be
> ideal if we could ramp up the writes so that when it is 95% to the next
> checkpoint, it can be operating at 95% of the activity it would do
> during a checkpoint.
>
> My guess is if we can do that, we will have much smoother performance
> because we have more WAL writes just after checkpoint for newly-dirtied
> pages, and the new setup will give us more write activity just before
> checkpoint.
>
> One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
> the kernel cache, so we are sure data will be on disk by checkpoint
> time. This was avoided in the past because of the expense of
> second-guessing the kernel disk I/O scheduling algorithms.
>
> ---------------------------------------------------------------------------
>
> Tom Lane wrote:
> > "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> > > "Jim C. Nasby" <jim(at)nasby(dot)net> wrote:
> > >> Generally, I try and configure the all* settings so that you'll get 1
> > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> > >> have any actual tests to back that methodology up.
> >
> > > We got to these numbers somewhat scientifically. I studied I/O
> > > patterns under production load and figured we should be able to handle
> > > about 800 writes in per 200 ms without causing problems. I have to
> > > admit that I based the percentages and the ratio between "all" and "lru"
> > > on gut feel after musing over the documentation.
> >
> > I like Kevin's settings better than what Jim suggests. If the bgwriter
> > only makes one sweep between checkpoints then it's hardly going to make
> > any impact at all on the number of dirty buffers the checkpoint will
> > have to write. The point of the bgwriter is to reduce the checkpoint
> > I/O spike by doing writes between checkpoints, and to have any
> > meaningful impact on that, you'll need it to make the cycle several times.
> >
> > Another point here is that you want checkpoints to be pretty far apart
> > to minimize the WAL load from full-page images. So again, a bgwriter
> > that's only making one loop per checkpoint is not gonna be doing much.
> >
> > I wonder whether it would be feasible to teach the bgwriter to get more
> > aggressive as the time for the next checkpoint approaches? Writes
> > issued early in the interval have a much higher probability of being
> > wasted (because the page gets re-dirtied later). But maybe that just
> > reduces to what Takahiro-san already suggested, namely that
> > checkpoint-time writes should be done with the same kind of scheduling
> > the bgwriter uses outside checkpoints. We still have the problem that
> > the real I/O storm is triggered by fsync() not write(), and we don't
> > have a way to spread out the consequences of fsync().
> >
> > regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> > choose an index scan if your joining column's datatypes do not
> > match
>
> --
> Bruce Momjian bruce(at)momjian(dot)us
> EnterpriseDB http://www.enterprisedb.com
>
> + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Inaam Rana 2006-12-22 18:56:07 Re: Load distributed checkpoint
Previous Message Tom Lane 2006-12-22 18:16:43 Re: recent --with-libxml support

Browse pgsql-patches by date

  From Date Subject
Next Message Inaam Rana 2006-12-22 18:56:07 Re: Load distributed checkpoint
Previous Message Zeugswetter Andreas ADI SD 2006-12-22 12:51:31 Re: Load distributed checkpoint