Re: Load distributed checkpoint

From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Re: Load distributed checkpoint
Date: 2006-12-28 11:35:52
Message-ID: 20061228113551.GP71246@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Wed, Dec 27, 2006 at 10:54:57PM +0000, Simon Riggs wrote:
> On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:
> > On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:
> > > On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
> > >
> > > > I assume other kernels have similar I/O smoothing, so that data sent to
> > > > the kernel via write() gets to disk within 30 seconds.
> > > >
> > > > I assume write() is not our checkpoint performance problem, but the
> > > > transfer to disk via fsync().
> > >
> > > Well, its correct to say that the transfer to disk is the source of the
> > > problem, but that doesn't only occur when we fsync(). There are actually
> > > two disk storms that occur, because of the way the fs cache works. [Ron
> > > referred to this effect uplist]
> >
> > As someone looking from the outside:
> >
> > fsync only works on one file, so presumably the checkpoint process is
> > opening each file one by one and fsyncing them.
>
> Yes
>
> > Does that make any
> > difference here? Could you adjust the timing here?
>
> Thats the hard bit with io storm 2. When you fsync a file you don't
> actually know how many blocks you're writing, plus there's no way to
> slow down those writes by putting delays between them (although its
> possible your controller might know how to do this, I've never heard of
> one that does).

Any controller that sophisticated would likely also have a BBU and write
caching, which should greatly reduce the impact of at least the fsync
storm... unless you fill the cache. I suspect we might need a way to
control how much data we try and push out at a time to avoid that...

As for settings, I really like the simplicity of the Oracle system...
"Just try to ensure recovery takes about X amount of seconds". I like
the idea of a creeping checkpoint even more; only writing a buffer out
when we need to checkpoint it makes a lot more sense to me than trying
to guess when we'll next dirty a buffer. Such a system would probably
also be a lot easier to tune than the current bgwriter, even if we
couldn't simplify it all the way to "seconds for recovery".
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim C. Nasby 2006-12-28 12:13:37 Re: Dirty pages in freelist cause WAL stuck
Previous Message ITAGAKI Takahiro 2006-12-28 09:26:37 Re: Dead Space Map for vacuum

Browse pgsql-patches by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2006-12-28 13:41:57 Re: Recent SIGSEGV failures in buildfarm HEAD
Previous Message Ron Mayer 2006-12-28 09:18:56 Re: Load distributed checkpoint