Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-06-28 12:53:14
Message-ID: 51CD873A.5070705@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2013/06/28 0:08), Robert Haas wrote:
> On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
> it didn't work that well. I have also tried it and the resulting
> behavior was unimpressive. It makes checkpoints take a long time to
> complete even when there's very little data to flush out to the OS,
> which is annoying; and when things actually do get ugly, the sleeps
> aren't long enough to matter. See the timings Kondo-san posted
> downthread: 100ms delays aren't going let the system recover in any
> useful way when the fsync can take 13 s for one file. On a system
> that's badly weighed down by I/O, the fsync times are often
> *extremely* long - 13 s is far from the worst you can see. You have
> to give the system a meaningful time to recover from that, allowing
> other processes to make meaningful progress before you hit it again,
> or system performance just goes down the tubes. Greg's test, IIRC,
> used 3 s sleeps rather than your proposal of 100 ms, but it still
> wasn't enough.
Yes. In write phase, checkpointer writes numerous 8KB dirty pages in each
SyncOneBuffer(), therefore it can be well for tiny(100ms) sleep time. But
in fsync phase, checkpointer writes scores of relation files in each fsync(),
therefore it can not be well for tiny sleep. It shoud need longer sleep time
for recovery IO performance. If we know its best sleep time, we had better use
previous fsync time. And if we want to prevent fast long fsync time, we had
better change relation file size which is 1GB in default max size to smaller.

Go back to the subject. Here is our patches test results. Fsync + write patch was
not good result in past result, so I retry benchmark in same condition. It seems
to get good perfomance than past result.

* Performance result in DBT-2 (WH340)
| TPS 90%tile Average Maximum
---------------+---------------------------------------
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3586.85 14.459486 4.960 27.266958
Heikki's patch | 3504.3 19.731743 5.761 38.33814

* HTML result in DBT-2
http://pgstatsinfo.projects.pgfoundry.org/RESULT/

In attached text, I also describe in each checkpoint time. fsync patch was seemed
to have longer time than not fsync patch. However, checkpoint schedule is on time
in checkpoint_timeout and allowable time. I think that it is most important
things in fsync phase that fast finished checkpoint is not but definitely and
assurance write pages in end of checkpoint. So my fsync patch is not wrong
working any more.

My write patch seems to have lot of riddle, so I try to investigate objective
result and theory of effect.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment Content-Type Size
result_DBT-2.txt text/plain 10.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-28 13:08:37 Re: Move unused buffers to freelist
Previous Message Robert Haas 2013-06-28 12:50:04 Re: Move unused buffers to freelist