On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> I've attached an updated version of the initial sync spreading patch here,
> one that applies cleanly on top of HEAD and over top of the sync
> instrumentation patch too. The conflict that made that hard before is gone
With the fsync queue compaction patch applied, I think most of this is
now not needed. Attached please find an attempt to isolate the
portion that looks like it might still be useful. The basic idea of
what remains here is to make the background writer still do its normal
stuff even when it's checkpointing. In particular, with this patch
applied, PG will:
1. Absorb fsync requests a lot more often during the sync phase.
2. Still try to run the cleaning scan during the sync phase.
3. Pause for 3 seconds after every fsync.
I suspect that #1 is probably a good idea. It seems pretty clear
based on your previous testing that the fsync compaction patch should
be sufficient to prevent us from hitting the wall, but if we're going
to any kind of nontrivial work here then cleaning the queue is a
sensible thing to do along the way, and there's little downside.
I also suspect #2 is a good idea. The fact that we're checkpointing
doesn't mean that the system suddenly doesn't require clean buffers,
and the experimentation I've done recently (see: limiting hint bit
I/O) convinces me that it's pretty expensive from a performance
standpoint when backends have to start writing out their own buffers,
so continuing to do that work during the sync phase of a checkpoint,
just as we do during the write phase, seems pretty sensible.
I think something along the lines of #3 is probably a good idea, but
the current coding doesn't take checkpoint_completion_target into
account. The underlying problem here is that it's at least somewhat
reasonable to assume that if we write() a whole bunch of blocks, each
write() will take approximately the same amount of time. But this is
not true at all with respect to fsync() - they neither take the same
amount of time as each other, nor is there any fixed ratio between
write() time and fsync() time to go by. So if we want the checkpoint
to finish in, say, 20 minutes, we can't know whether the write phase
needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.
One idea I have is to try to get some of the fsyncs out of the queue
at times other than end-of-checkpoint. Even if this resulted in some
modest increase in the total number of fsync() calls, it might improve
performance by causing data to be flushed to disk in smaller chunks.
For example, suppose we kept an LRU list of pending fsync requests -
every time we remember an fsync request for a particular relation, we
move it to the head (hot end) of the LRU. And periodically we pull
the tail entry off the list and fsync it - say, after
checkpoint_timeout / (# of items in the list). That way, when we
arrive at the end of the checkpoint and starting syncing everything,
the syncs hopefully complete more quickly because we've already forced
a bunch of the data down to disk. That algorithm may well be too
stupid or just not work in real life, but perhaps there's some
variation that would be sensible. The point is: instead of or in
addition to trying to spread out the sync phase, we might want to
investigate whether it's possible to reduce its size.
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Itagaki Takahiro||Date: 2011-01-31 06:49:07|
|Subject: Re: multiset patch review|
|Previous:||From: Tatsuo Ishii||Date: 2011-01-31 04:06:59|
|Subject: Re: pg_ctl failover Re: Latches, signals, and waiting|