Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-14 20:28:07
Message-ID: 51E309D7.5030604@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/3/13 9:39 AM, Andres Freund wrote:
> I wonder how much of this could be gained by doing a
> sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
> the original checkpoint-pass through the buffers or when fsyncing the
> files.

The fsync calls decomposing into the queued set of block writes. If
they all need to go out eventually to finish a checkpoint, the most
efficient way from a throughput perspective is to dump them all at once.

I'm not sure sync_file_range targeting checkpoint writes will turn out
any differently than block sorting. Let's say the database tries to get
involved in forcing a particular write order that way. Right now it's
going to be making that ordering decision without the benefit of also
knowing what blocks are being read. That makes it hard to do better
than the OS, which knows a different--and potentially more useful in a
ready-heavy environment--set of information about all the pending I/O.
And it would be very expensive to made all the backends start sharing
information about what they read to ever pull that logic into the
database. It's really easy to wander down the path where you assume you
must know more than the OS does, which leads to things like direct I/O.
I am skeptical of that path in general. I really don't want Postgres
to be competing with the innovation rate in Linux kernel I/O if we can
ride it instead.

One idea I was thinking about that overlaps with a sync_file_range
refactoring is simply tracking how many blocks have been written to each
relation. If there was a rule like "fsync any relation that's gotten
more than 100 8K writes", we'd never build up the sort of backlog that
causes the worst latency issues. You really need to start tracking the
file range there, just to fairly account for multiple writes to the same
block. One of the reasons I don't mind all the work I'm planning to put
into block write statistics is that I think that will make it easier to
build this sort of facility too. The original page write and the fsync
call that eventually flushes it out are very disconnected right now, and
file range data seems the right missing piece to connect them well.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2013-07-14 20:45:26 Re: Materialized views WIP patch
Previous Message Greg Smith 2013-07-14 19:14:44 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)