Re: Spread checkpoint sync

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-02-01 17:44:05
Message-ID: AANLkTi=CbZMgxg=SX=H=Vpe=SCTT03OopQEhvJbfGEhd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 31, 2011 at 4:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Back to the idea at hand - I proposed something a bit along these
>> lines upthread, but my idea was to proactively perform the fsyncs on
>> the relations that had gone the longest without a write, rather than
>> the ones with the most dirty data.
>
> Yeah.  What I meant to suggest, but evidently didn't explain well, was
> to use that or something much like it as the rule for deciding *what* to
> fsync next, but to use amount-of-unsynced-data-versus-threshold as the
> method for deciding *when* to do the next fsync.

Oh, I see. Yeah, that could be a good algorithm.

I also think Bruce's idea of calling fsync() on each relation just
*before* we start writing the pages from that relation might have some
merit. (I'm assuming here that we are sorting the writes.) That
should tend to result in the end-of-checkpoint fsyncs being quite
fast, because we'll only have as much dirty data floating around as we
actually wrote during the checkpoint, which according to Greg Smith is
usually a small fraction of the total data in need of flushing. Also,
if one of the pre-write fsyncs takes a long time, then that'll get
factored into our calculations of how fast we need to write the
remaining data to finish the checkpoint on schedule. Of course
there's still the possibility that the I/O system literally can't
finish a checkpoint in X minutes, but even in that case, the I/O
saturation will hopefully be more spread out across the entire
checkpoint instead of falling like a hammer at the very end.

Back to your idea: One problem with trying to bound the unflushed data
is that it's not clear what the bound should be. I've had this mental
model where we want the OS to write out pages to disk, but that's not
always true, per Greg Smith's recent posts about Linux kernel tuning
slowing down VACUUM. A possible advantage of the Momjian algorithm
(as it's known in the literature) is that we don't actually start
forcing anything out to disk until we have a reason to do so - namely,
an impending checkpoint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-02-01 17:47:03 Re: FPI
Previous Message Tom Lane 2011-02-01 17:41:38 Re: FPI