Re: Load distributed checkpoint

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Load distributed checkpoint
Date: 2006-12-12 17:15:06
Message-ID: 200612121715.kBCHF6d24465@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches


I have thought a while about this and I have some ideas.

Ideally, we would be able to trickle the sync of individuals blocks
during the checkpoint, but we can't because we rely on the kernel to
sync all dirty blocks that haven't made it to disk using fsync(). We
could trickle the fsync() calls, but that just extends the amount of
data we are writing that has been dirtied post-checkpoint. In an ideal
world, we would be able to fsync() only part of a file at a time, and
only those blocks that were dirtied pre-checkpoint, but I don't see that
happening anytime soon (and one reason why many commercial databases
bypass the kernel cache).

So, in the real world, one conclusion seems to be that our existing
method of tuning the background writer just isn't good enough for the
average user:

#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round

These settings control what the bgwriter does, but they do not clearly
relate to the checkpoint timing, which is the purpose of the bgwriter,
and they don't change during the checkpoint interval, which is also less
than ideal. If set to aggressively, it writes too much, and if too low,
the checkpoint does too much I/O.

We clearly need more bgwriter activity as the checkpoint approaches, and
one that is more auto-tuned, like many of our other parameters. I think
we created these settings to see how they worked in the field, so it
probably time to reevaluate them based on field reports.

I think the bgwriter should keep track of how far it is to the next
checkpoint, and use that information to increase write activity.
Basically now, during a checkpoint, the bgwriter does a full buffer scan
and fsync's all dirty files, so it changes from the configuration
parameter-defined behavior right to 100% activity. I think it would be
ideal if we could ramp up the writes so that when it is 95% to the next
checkpoint, it can be operating at 95% of the activity it would do
during a checkpoint.

My guess is if we can do that, we will have much smoother performance
because we have more WAL writes just after checkpoint for newly-dirtied
pages, and the new setup will give us more write activity just before
checkpoint.

One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
the kernel cache, so we are sure data will be on disk by checkpoint
time. This was avoided in the past because of the expense of
second-guessing the kernel disk I/O scheduling algorithms.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> > "Jim C. Nasby" <jim(at)nasby(dot)net> wrote:
> >> Generally, I try and configure the all* settings so that you'll get 1
> >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> >> have any actual tests to back that methodology up.
>
> > We got to these numbers somewhat scientifically. I studied I/O
> > patterns under production load and figured we should be able to handle
> > about 800 writes in per 200 ms without causing problems. I have to
> > admit that I based the percentages and the ratio between "all" and "lru"
> > on gut feel after musing over the documentation.
>
> I like Kevin's settings better than what Jim suggests. If the bgwriter
> only makes one sweep between checkpoints then it's hardly going to make
> any impact at all on the number of dirty buffers the checkpoint will
> have to write. The point of the bgwriter is to reduce the checkpoint
> I/O spike by doing writes between checkpoints, and to have any
> meaningful impact on that, you'll need it to make the cycle several times.
>
> Another point here is that you want checkpoints to be pretty far apart
> to minimize the WAL load from full-page images. So again, a bgwriter
> that's only making one loop per checkpoint is not gonna be doing much.
>
> I wonder whether it would be feasible to teach the bgwriter to get more
> aggressive as the time for the next checkpoint approaches? Writes
> issued early in the interval have a much higher probability of being
> wasted (because the page gets re-dirtied later). But maybe that just
> reduces to what Takahiro-san already suggested, namely that
> checkpoint-time writes should be done with the same kind of scheduling
> the bgwriter uses outside checkpoints. We still have the problem that
> the real I/O storm is triggered by fsync() not write(), and we don't
> have a way to spread out the consequences of fsync().
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2006-12-12 17:38:18 Re: Load distributed checkpoint
Previous Message Ted Petrosky 2006-12-12 16:58:34 libpq.a in a universal binary

Browse pgsql-patches by date

  From Date Subject
Next Message Richard Huxton 2006-12-12 17:22:16 Re: date comparisons
Previous Message Belinda M. Giardine 2006-12-12 17:03:20 Re: date comparisons