Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: james <james(at)mansionfamily(dot)plus(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-14 21:28:50
Message-ID: 51E31812.1030908@mansionfamily.plus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14/07/2013 20:13, Greg Smith wrote:
> The most efficient way to write things out is to delay those writes as
> long as possible.

That doesn't smell right to me. It might be that delaying allows more
combining and allows the kernel to see more at once and optimise it, but
I think the counter-argument is that it is an efficiency loss to have
either CPU or disk idle waiting on the other. It cannot make sense from
a throughput point of view to have disks doing nothing and then become
overloaded so they are a bottleneck (primarily seeking) and the CPU does
nothing.

Now I have NOT measured behaviour but I'd observe that we see disks that
can stream 100MB/s but do only 5% of that if they are doing random IO.
Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free. The
flip side is that we can't really know whether they will get merged with
adjacent writes later so its hard to schedule them early. But we can
observe that if we have a bunch of writes to adjacent data then a seek
to do the write is effectively amortised across them.

So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them to be pushed out early (if not immediately), ideally out
as far as the drive (but not flushed from its cache) and without forcing
all other data to be flushed too. And perhaps we should always look to
be getting drives dedicated to dbms to do something, even if it turns
out to have been redundant in the end.

That's not necessarily easy on Linux without using a direct unbuffered
IO but to me that is Linux' problem. For a start its not the only
target system, and having feedback 'we need' from db and mail system
groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to
hear what Oracle and Sybase devs felt they needed either.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2013-07-14 21:42:41 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Previous Message Noah Misch 2013-07-14 20:45:26 Re: Materialized views WIP patch