Re: Spread checkpoint sync

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-02-01 18:32:28
Message-ID: AANLkTimgvaV-sFOd6Ces4YXK_ZzEuJ=p3GkyEKubVjPH@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 1, 2011 at 12:58 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> I also think Bruce's idea of calling fsync() on each relation just
>> *before* we start writing the pages from that relation might have
>> some merit.
>
> What bothers me about that is that you may have a lot of the same
> dirty pages in the OS cache as the PostgreSQL cache, and you've just
> ensured that the OS will write those *twice*.  I'm pretty sure that
> the reason the aggressive background writer settings we use have not
> caused any noticeable increase in OS disk writes is that many
> PostgreSQL writes of the same buffer keep an OS buffer page from
> becoming stale enough to get flushed until PostgreSQL writes to it
> taper off.  Calling fsync() right before doing "one last push" of
> the data could be really pessimal for some workloads.

I was thinking about what Greg reported here:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg01387.php

If the amount of pre-checkpoint dirty data is 3GB and the checkpoint
is writing 250MB, then you shouldn't have all that many extra
writes... but you might have some, and that might be enough to send
the whole thing down the tubes.

InnoDB apparently handles this problem by advancing the redo pointer
in small steps instead of in large jumps. AIUI, in addition to
tracking the LSN of each page, they also track the first-dirtied LSN.
That lets you checkpoint to an arbitrary LSN by flushing just the
pages with an older first-dirtied LSN. So instead of doing a
checkpoint every hour, you might do a mini-checkpoint every 10
minutes. Since the mini-checkpoints each need to flush less data,
they should be less disruptive than a full checkpoint. But that, too,
will generate some extra writes. Basically, any idea that involves
calling fsync() more often is going to tend to smooth out the I/O load
at the cost of some increase in the total number of writes.

If we don't want any increase at all in the number of writes,
spreading out the fsync() calls is pretty much the only other option.
I'm worried that even with good tuning that won't be enough to tamp
down the latency spikes. But maybe it will be...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-02-01 18:33:56 Re: log_hostname and pg_stat_activity
Previous Message Bruce Momjian 2011-02-01 18:32:22 Re: Spread checkpoint sync