Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-06-27 15:08:59
Message-ID: CA+TgmoZj-0CFLbjceMtgi1t_=hpfe9z8uSa+D0g77sNYcFy04w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>> The only feedback we have on how bad things are is how long it took
>> the last fsync to complete, so I actually think that's a much better
>> way to go than any fixed sleep - which will often be unnecessarily
>> long on a well-behaved system, and which will often be far too short
>> on one that's having trouble. I'm inclined to think think Kondo-san
>> has got it right.
>
> Quite possible, I really don't know. I'm inclined to first try the simplest
> thing possible, and only make it more complicated if that's not good enough.
> Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep
> between every fsync, unless you're behind the schedule, is even simpler.

I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well. I have also tried it and the resulting
behavior was unimpressive. It makes checkpoints take a long time to
complete even when there's very little data to flush out to the OS,
which is annoying; and when things actually do get ugly, the sleeps
aren't long enough to matter. See the timings Kondo-san posted
downthread: 100ms delays aren't going let the system recover in any
useful way when the fsync can take 13 s for one file. On a system
that's badly weighed down by I/O, the fsync times are often
*extremely* long - 13 s is far from the worst you can see. You have
to give the system a meaningful time to recover from that, allowing
other processes to make meaningful progress before you hit it again,
or system performance just goes down the tubes. Greg's test, IIRC,
used 3 s sleeps rather than your proposal of 100 ms, but it still
wasn't enough.

> In
> particular, it's easier to tie that into the checkpoint scheduler - I'm not
> sure how you'd measure progress or determine how long to sleep unless you
> assume that every fsync is the same.

I think the thing to do is assume that the fsync phase will take 10%
or so of the total checkpoint time, but then be prepared to let the
checkpoint run a bit longer if the fsyncs end up being slow. As Greg
has pointed out during prior discussions of this, the normal scenario
when things get bad here is that there is no way in hell you're going
to fit the checkpoint into the originally planned time. Once all of
the write caches between PostgreSQL and the spinning rust are full,
the system is in trouble and things are going to suck. The hope is
that we can stop beating the horse while it is merely in intensive
care rather than continuing until the corpse is fully skeletized.
Fixed delays don't work because - to push an already-overdone metaphor
a bit further - we have no idea how much of a beating the horse can
take; we need something adaptive so that we respond to what actually
happens rather than making predictions that will almost certainly be
wrong a large fraction of the time.

To put this another way, when we start the fsync() phase, it often
consumes 100% of the available I/O on the machine, completing starving
every other process that might need any. This is certainly a
deficiency in the Linux I/O scheduler, but as they seem in no hurry to
fix it we'll have to cope with it as best we can. If you do the
fsyncs in fast succession (and 100ms separation might as well be no
separation at all), then the I/O starvation of the entire system
persists through the entire fsync phase. If, on the other hand, you
sleep for the same amount of time the previous fsync took, then on the
average, 50% of the machine's I/O capacity will be available for all
other system activity throughout the fsync phase, rather than 0%.

Now, unfortunately, this is still not that good, because it's often
the case that all of the fsyncs except one are reasonably fast, and
there's one monster one that is very slow. ext3 has a known bad
behavior that dumps all dirty data for the entire *filesystem* when
you fsync, which tends to create these kinds of effects. But even on
better-behaved filesystem, like ext4, it's fairly common to have one
fsync that is painfully longer than all the others. So even with
this patch, there are still going to be cases where the whole system
becomes unresponsive. I don't see any way to to do better without a
better kernel API, or a better I/O scheduler, but that doesn't mean we
shouldn't do at least this much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Boszormenyi Zoltan 2013-06-27 15:12:04 Re: Error code returned by lock_timeout
Previous Message Michael Meskes 2013-06-27 15:07:47 Re: [9.3 doc fix] ECPG VAR command does not describe the actual specification