Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-06-25 20:28:12
Message-ID: 51C9FD5C.5050706@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 25.06.2013 23:03, Robert Haas wrote:
> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I'm not sure it's a good idea to sleep proportionally to the time it took to
>> complete the previous fsync. If you have a 1GB cache in the RAID controller,
>> fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
>> will return immediately. So we proceed fsyncing other files, until the cache
>> is full and the fsync blocks. But once we fill up the cache, it's likely
>> that we're hurting concurrent queries. ISTM it would be better to stay under
>> that threshold, keeping the I/O system busy, but never fill up the cache
>> completely.
>
> Isn't the behavior implemented by the patch a reasonable approximation
> of just that? When the fsyncs start to get slow, that's when we start
> to sleep. I'll grant that it would be better to sleep when the
> fsyncs are *about* to get slow, rather than when they actually have
> become slow, but we have no way to know that.

Well, that's the point I was trying to make: you should sleep *before*
the fsyncs get slow.

> The only feedback we have on how bad things are is how long it took
> the last fsync to complete, so I actually think that's a much better
> way to go than any fixed sleep - which will often be unnecessarily
> long on a well-behaved system, and which will often be far too short
> on one that's having trouble. I'm inclined to think think Kondo-san
> has got it right.

Quite possible, I really don't know. I'm inclined to first try the
simplest thing possible, and only make it more complicated if that's not
good enough. Kondo-san's patch wasn't very complicated, but nevertheless
a fixed sleep between every fsync, unless you're behind the schedule, is
even simpler. In particular, it's easier to tie that into the checkpoint
scheduler - I'm not sure how you'd measure progress or determine how
long to sleep unless you assume that every fsync is the same.

> I like your idea of putting a stake in the ground and assuming that
> the fsync phase will turn out to be X% of the checkpoint, but I wonder
> if we can be a bit more sophisticated, especially for cases where
> checkpoint_segments is small. When checkpoint_segments is large, then
> we know that some of the data will get written back to disk during the
> write phase, because the OS cache is only so big. But when it's
> small, the OS will essentially do nothing during the write phase, and
> then it's got to write all the data out during the fsync phase. I'm
> not sure we can really model that effect thoroughly, but even
> something dumb would be smarter than what we have now - e.g. use 10%,
> but when checkpoint_segments< 10, use 1/checkpoint_segments. Or just
> assume the fsync phase will take 30 seconds.

If checkpoint_segments < 10, there isn't very much dirty data to flush
out. This isn't really problem in that case - no matter how stupidly we
do the writing and fsyncing. the I/O cache can absorb it. It doesn't
really matter what we do in that case.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2013-06-25 20:28:51 Re: Hash partitioning.
Previous Message Robert Haas 2013-06-25 20:03:42 Re: Improvement of checkpoint IO scheduler for stable transaction responses