Re: Controlling Load Distributed Checkpoints

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Controlling Load Distributed Checkpoints
Date: 2007-06-11 07:51:51
Message-ID: Pine.GSO.4.64.0706110316020.9600@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:

> If the kernel can treat sequential writes better than random writes, is
> it worth sorting dirty buffers in block order per file at the start of
> checkpoints?

I think it has the potential to improve things. There are three obvious
and one subtle argument against it I can think of:

1) Extra complexity for something that may not help. This would need some
good, robust benchmarking improvements to justify its use.

2) Block number ordering may not reflect actual order on disk. While
true, it's got to be better correlated with it than writing at random.

3) The OS disk elevator should be dealing with this issue, particularly
because it may really know the actual disk ordering.

Here's the subtle thing: by writing in the same order the LRU scan occurs
in, you are writing dirty buffers in the optimal fashion to eliminate
client backend writes during BuferAlloc. This makes the checkpoint a
really effective LRU clearing mechanism. Writing in block order will
change that.

I spent some time trying to optimize the elevator part of this operation,
since I knew that on the system I was using block order was actual order.
I found that under Linux, the behavior of the pdflush daemon that manages
dirty memory had a more serious impact on writing behavior at checkpoint
time than playing with the elevator scheduling method did. The way
pdflush works actually has several interesting implications for how to
optimize this patch. For example, how writes get blocked when the dirty
memory reaches certain thresholds means that you may not get the full
benefit of the disk elevator at checkpoint time the way most would expect.

Since much of that was basically undocumented, I had to write my own
analysis of the actual workings, which is now available at
http://www.westnet.com/~gsmith/content/linux-pdflush.htm I hope that
anyone who wants more information about how Linux kernel parameters like
dirty_background_ratio actually work, and how they impact the writing
strategy, should find that article uniquely helpful.

> Some kernels or storage subsystems treat all I/Os too fairly so that
> user transactions waiting for reads are blocked by checkpoints writes.

In addition to that (which I've seen happen quite a bit), in the Linux
case another fairness issue is that the code that handles writes allows a
single process writing a lot of data to block writes for everyone else.
That means that in addition to being blocked on actual reads, if a client
backend starts a write in order to complete a buffer allocation to hold
new information, that can grind to a halt because of the checkpoint
process as well.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2007-06-11 08:03:57 Re: ecpg leaves broken files around
Previous Message Kris Jurka 2007-06-11 07:43:42 Re: So, why isn't *every* buildfarm member failing ecpg right now?

Browse pgsql-patches by date

  From Date Subject
Next Message Heikki Linnakangas 2007-06-11 09:27:30 Re: Controlling Load Distributed Checkpoints
Previous Message ITAGAKI Takahiro 2007-06-11 06:27:48 Re: Controlling Load Distributed Checkpoints