Re: Design proposal: fsync absorb linear slider

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-08-27 10:26:30
Message-ID: 521C7ED6.3010108@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:
> I think that it is almost same as small dirty_background_ratio or
> dirty_background_bytes.

The main difference here is that all writes pushed out this way will be
to a single 1GB relation chunk. The odds are better that multiple
writes will combine, and that the I/O will involve a lower than average
amount of random seeking. Whereas shrinking the size of the write cache
always results in more random seeking.

> The essential improvement is not dirty page size in fsync() but
> scheduling of fsync phase.
> I can't understand why postgres does not consider scheduling of fsync
> phase.

Because it cannot get the sort of latency improvements I think people
want. I proved to myself it's impossible during the last 9.2 CF when I
submitted several fsync scheduling change submissions.

By the time you get to the fsync sync phase, on a system that's always
writing heavily there is way too much backlog to possibly cope with by
then. There just isn't enough time left before the checkpoint should
end to write everything out. You have to force writes to actual disk to
start happening earlier to keep a predictable schedule. Basically, the
longer you go without issuing a fsync, the more uncertainty there is
around how long it might take to fire. My proposal lets someone keep
all I/O from ever reaching the point where the uncertainty is that high.

In the simplest to explain case, imagine that a checkpoint includes a
1GB relation segment that is completely dirty in shared_buffers. When a
checkpoint hits this, it will have 1GB of I/O to push out.

If you have waited this long to fsync the segment, the problem is now
too big to fix by checkpoint time. Even if the 1GB of writes are
themselves nicely ordered and grouped on disk, the concurrent background
ability is going to chop the combination up into more random I/O than
the ideal.

Regular consumer disks have a worst case random I/O throughput of less
than 2MB/s. My observed progress rates for such systems show you're
lucky to get 10MB/s of writes out of them. So how long will the dirty
1GB in the segment take to write? 1GB @ 10MB/s = 102.4 *seconds*. And
that's exactly what I saw whenever I tried to play with checkpoint sync
scheduling. No matter what you do there, periodically you'll hit a
segment that has over a minute of dirty data accumulated, and >60 second
latency pauses result. By the point you've reached checkpoint, you're
dead when you call fsync on that relation. You *must* hit that segment
with fsync more often than once per checkpoint to achieve reasonable
latency.

With this "linear slider" idea, I might tune such that no segment will
ever get more than 256MB of writes before hitting a fsync instead. I
can't guarantee that will work usefully, but the shape of the idea seems
to match the problem.

> Taken together my checkpoint proposal method,
> * write phase
> - Almost same, but considering fsync phase schedule.
> - Considering case of background-write in OS, sort buffer before
> starting checkpoint write.

This cannot work for the reasons I've outlined here. I guarantee you I
will easily find a test workload where it performs worse than what's
happening right now. If you want to play with this to learn more about
the trade-offs involved, that's fine, but expect me to vote against
accepting any change of this form. I would prefer you to not submit
them because it will waste a large amount of reviewer time to reach that
conclusion yet again. And I'm not going to be that reviewer.

> * fsync phase
> - Considering checkpoint schedule and write-phase schedule
> - Executing separated sync_file_range() and sleep, in final fsync().

If you can figure out how to use sync_file_range() to fine tune how much
fsync is happening at any time, that would be useful on all the
platforms that support it. I haven't tried it just because that looked
to me like a large job refactoring the entire fsync absorb mechanism,
and I've never had enough funding to take it on. That approach has a
lot of good properties, if it could be made to work without a lot of
code changes.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2013-08-27 11:21:09 Re: Patch: Allow formatting in log_line_prefix
Previous Message Ashutosh Bapat 2013-08-27 08:27:49 Clarification on materialized view restriction needed