Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-01-16 07:28:58
Message-ID: 4D329E3A.2030803@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> What is the basis for thinking that the sync should get the same
> amount of time as the writes? That seems pretty arbitrary. Right
> now, you're allowing 3 seconds per fsync, which could be a lot more or
> a lot less than 40% of the total checkpoint time...

Just that it's where I ended up at when fighting with this for a month
on the system I've seen the most problems at. The 3 second number was
reversed from a computation that said "aim for an internal of X minutes;
we have Y relations on average involved in the checkpoint". The
direction my latest patch is strugling to go is computing a reasonable
time automatically in the same way--count the relations, do a time
estimate, add enough delay so the sync calls should be spread linearly
over the given time range.

> the checkpoint activity is always going to be spikey if it does
> anything at all, so spacing it out *more* isn't obviously useful.
>

One of the components to the write queue is some notion that writes that
have been waiting longest should eventually be flushed out. Linux has
this number called dirty_expire_centiseconds which suggests it enforces
just that, set to a default of 30 seconds. This is why some 5-minute
interval checkpoints with default parameters, effectively spreading the
checkpoint over 2.5 minutes, can work under the current design.
Anything you wrote at T+0 to T+2:00 *should* have been written out
already when you reach T+2:30 and sync. Unfortunately, when the system
gets busy, there is this "congestion control" logic that basically
throws out any guarantee of writes starting shortly after the expiration
time.

It turns out that the only thing that really works are the tunables that
block new writes from happening once the queue is full, but they can't
be set low enough to work well in earlier kernels when combined with
lots of RAM. Using the terminology of
http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point
you hit a point where "a process generating disk writes will itself
start writeback." This is anologous to the PostgreSQL situation where
backends do their own fsync calls. The kernel will eventually move to
where those trying to write new data are instead recruited into being
additional sources of write flushing. That's the part you just can't
make aggressive enough on older kernels; dirty writers can always win.
Ideally, the system never digs itself into a hole larger than you can
afford to wait to write out. It's a transacton speed vs. latency thing
though, and the older kernels just don't consider the latency side well
enough.

There is new mechanism in the latest kernels to control this much
better: dirty_bytes and dirty_background_bytes are the tunables. I
haven't had a chance to test yet. As mentioned upthread, some of the
bleding edge kernels that have this feature available in are showing
such large general performance regressions in our tests, compared to the
boring old RHEL5 kernel, that whether this feature works or not is
irrelevant. I haven't tracked down which new kernel distributions work
well performance-wise and which don't yet for PostgreSQL.

I'm hoping that when I get there, I'll see results like
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages
, where the ideal setting for dirty_bytes to keep latency under control
with BBWC was 15MB. To put that into perspective, the lowest useful
setting you can set dirty_ratio to is 5% of RAM. That's 410MB on my
measly 8GB desktop, and 3.3GB on the 64GB production server I've been
trying to tune.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nicolas Barbier 2011-01-16 11:29:37 Re: ToDo List Item - System Table Index Clustering
Previous Message Simone Aiken 2011-01-16 05:11:26 ToDo List Item - System Table Index Clustering